Fast multi-source nanophotonic simulations using augmented partial factorization

Numerical solutions of Maxwell’s equations are indispensable for nanophotonics and electromagnetics but are constrained when it comes to large systems, especially multi-channel ones such as disordered media, aperiodic metasurfaces and densely packed photonic circuits where the many inputs require many large-scale simulations. Conventionally, before extracting the quantities of interest, Maxwell’s equations are first solved on every element of a discretization basis set that contains much more information than is typically needed. Furthermore, such simulations are often performed one input at a time, which can be slow and repetitive. Here we propose to bypass the full-basis solutions and directly compute the quantities of interest while also eliminating the repetition over inputs. We do so by augmenting the Maxwell operator with all the input source profiles and all the output projection profiles, followed by a single partial factorization that yields the entire generalized scattering matrix via the Schur complement, with no approximation beyond discretization. This method applies to any linear partial differential equation. Benchmarks show that this approach is 1,000–30,000,000 times faster than existing methods for two-dimensional systems with about 10,000,000 variables. As examples, we demonstrate simulations of entangled photon backscattering from disorder and high-numerical-aperture metalenses that are thousands of wavelengths wide.

The interaction between light and nanostructured materials leads to rich properties. For small systems such as individual nano-/micro-structures and optical components, or for periodic systems such as photonic crystals and metamaterials that can be reduced to unit cells, one can readily solve Maxwell's equations numerically to obtain predictions that agree quantitatively with experiments. But the computation costs are typically too heavy for more complex systems such as disordered ones 1,2 which not only are large but also couple many incoming channels to many outgoing ones, requiring numerous simulations. The alternatives all have limitations: Born approximation does not describe multiple scattering, radiative transport and diagrammatic methods can only compute certain ensemble-averaged properties, 1,2 and coupled-mode theory is restricted to systems with few isolated resonances. 3,4 For metasurfaces, 5,6 the widely used locally periodic approximation 6,7 is inaccurate whenever the cell-to-cell variation is large [8][9][10][11][12] and cannot describe nonlocal responses 13,14 and metasurfaces that are not based on unit cells. [15][16][17]19 and quantum 20,21 photonic circuits build on individual components that couple very few channels at a time, limiting the number of inputs and outputs. Examples beyond photonics also abound. A wide range of studies across different disciplines are currently prohibited by computational limitations.
Regardless of the complexity of a system, its linear response is described exactly by a generalized M × M scattering matrix S that relates an arbitrary input vector v to the resulting output vector u via [22][23][24][25] The M columns of S correspond to M distinct inputs, * cwhsu@usc.edu illustrated in Fig. 1a-b. The elements of the input vector v can be the amplitudes of incoming channels in the far field or in waveguides, or point-dipole excitations in the near field, or any other input of interest; similarly, vector u can contain any output of interest.
Computing such a multi-input response typically requires M distinct solutions of Maxwell's equations with the same structure but with different source profiles. Time-domain methods 26 are easy to parallelize but cannot leverage the multi-input property. Frequency-domain methods allow strategies for handling many inputs. After volume discretization through finite element 27 or finite difference, 28,29 Maxwell's equations in frequency domain becomes a system of linear equations Ax m = b m ; sparse matrix A is the Maxwell differential operator, the right-hand-side column vector b m specifies the m-th input, and the solution is contained in column vector x m . When solving for x m = A −1 b m with direct methods, the sparsity can be utilized via graph partitioning, and the resulting LU factors can be reused among different inputs. 30,31 But M forward and backward substitutions are still needed, and the LU factors take up substantial memory. Iterative methods compute x m = A −1 b m by minimizing the residual, 32 avoiding the LU factors. One can iterate multiple inputs together 33 or construct preconditioners to be reused among different inputs, 34,35 but the iterations still take O(M ) time.
The boundary element method (BEM) 36 discretizes the interface between materials to reduce the size and the condition number of matrix A, though its matrix A is no longer sparse; it is efficient for homogeneous structures with a small surface-to-volume ratio. Instead of a surface mesh, the T -matrix method 37 uses vector spherical harmonics as basis functions, also resulting in a dense matrix A. The hierarchical structure of the dense matrix A can be utilized through the fast multipole method [38][39][40] within iterative solvers or through the H-matrix method 41  Generalized scattering matrix and augmented partial factorization (APF). a-b, Inputs from two different angles, corresponding to two columns of the generalized scattering matrix S. c, Illustration of Eq. (2), which relates S to the Maxwell operator A, source profiles B that generate the incident waves, projection profiles C that extract the outputs of interest, and matrix D that subtracts the baseline. d, The system used for the illustration in c, with color coding for different parts of the system. e, The augmented sparse matrix K of Eq. (3), whose partial factorization gives the scattering matrix S.
For systems with a closed boundary on the sides and inputs/outputs placed on the front and back surfaces, the recursive Green's function (RGF) method 42 can obtain the full scattering matrix without looping over the inputs, which is useful for disordered systems. 43,44 However, RGF works with dense Green's function matrices, so it scales unfavorably with the system width W as O(W 3(d−1) ) for computing time and O(W 2(d−1) ) for memory usage in d dimensions. For layered geometries, the rigorous coupled-wave analysis (RCWA) 45,46 and the eigenmode expansion (EME) 47 methods use local eigenmodes to utilize the intralayer axial translational symmetry, which also results in dense matrices and the same scaling as RGF.
These existing methods place the emphasis on solving Maxwell's equations, typically one input at a time, after which the quantities of interest are extracted from the solutions. Doing so is intuitive but leads to repetitions and unnecessary computations. Here we propose an "augmented partial factorization" (APF) method that directly computes the entire generalized scattering matrix of interest, without dividing among the inputs and bypassing the unnecessary solution steps. APF is general (applicable to any structure with any type of inputs and outputs, including to other linear partial differential equations), exact (no approximation beyond discretization), does not loop over the inputs, does not store large LU factors, scales well with the system size, and fully utilizes the sparsities of the Maxwell operator, of the inputs, and also of the outputs. These advantages lead to reduced memory usage and many-orders-of-magnitude speed-up compared to all existing methods (even the ones that specialize in a certain geometry), enabling full-wave simulations of massively multi-channel systems that were impossible in the past. . This dense and large matrix X is the conventional solution of Maxwell's equations, but its full content is rarely needed. The needed quantities are encapsulated in the generalized scattering matrix S, which we can write as Matrix C projects the collective solution X = A −1 B onto the M outputs of interest (e.g., sampling at the locations of interest, a conversion to propagating channels, or a near-to-far-field transformation); it is sparse since the projections only use part of the solution, and it is very fat since the number M of outputs of interest is generally far less than the number of elements Maxwell's equations are discretized onto. Matrix D = CA −1 0 B−S 0 optionally subtracts the baseline contribution, where A 0 is the Maxwell operator of a reference system (e.g., vacuum) for which the generalized scattering matrix S 0 is known. Eq. (2) has the same appearance as scattering matrices in coupled-mode theory, 48,49 but it is exact and parameter-free.
Time-dependent responses are described by a timeresolved scattering matrix, which is the Fourier transform of the frequency-domain scattering matrix. 50,51 Figure 1c-d illustrates Eq. (2) with a concrete example. Consider the transverse-magnetic fields in 2D for a system periodic in y with relative permittivity profile ε r (r) = ε r (x, y). The Maxwell differential operator on the out-of-plane electric field E z (r) at wavelength λ is −∇ 2 − (2π/λ) 2 ε r (r), which becomes matrix A when volume discretized with an outgoing boundary in x direction. Then matrix A −1 is the retarded Green's function G(r, r ) of this system. An incident plane wave e i(k in x x+k in y y) from the left can be generated with a source proportional to δ(x)e ik in y y on the front surface x = 0, and incident waves from the right can be similarly generated; these source profiles become the columns of matrix B when discretized. The coefficients of different outgoing plane waves to the left can be obtained from projections proportional to δ(x)e −ik out y y , and similarly with outgoing waves to the right; they become the rows of matrix C when discretized. In this particular example, Eq. (2) reduces to the discrete form of the Fisher-Lee relation in quantum transport 52 (see Supplementary Secs. 1-2). We only show a few discretized pixels and a few angles in Fig. 1c-d to simplify the schematic; in reality these numbers can readily exceed millions and thousands respectively. Note that matrices A, B, and C are all sparse here.
Instead of solving for X = A −1 B as is conventionally done, we directly compute the generalized scattering matrix S = CA −1 B − D, which is orders-of-magnitude smaller. To do so, we build an augmented sparse matrix K as illustrated in Fig. 1e and then perform a partial factorization, The factorization is partial as it stops after factorizing the upper-left block of K into A = LU. Such partial factorization can be carried out using established sparse linear-solver packages like MUMPS 53 and PARDISO. 54 Notably, we do not use the LU factors, and L and U here do not even need to be triangular. By equating the middle and the right-hand side of Eq. (3) block by block, we see that matrix H, called the Schur complement, 55 satisfies H = D−CA −1 B. Thus, we obtain the generalized scattering matrix via S = −H. In this way, a single factorization yields what conventional methods obtain from M separate simulations. We refer to this approach as "augmented partial factorization" (APF). APF is as general as Eq. (2), applicable to any linear partial differential equation, in any dimension, under any discretization scheme, with any boundary condition, for any type of inputs generated using any scheme (such as equivalent source for arbitrary incident waves like waveguide modes, 29,56 line sources, and point-dipole sources), and for any type of output projections. It allows arbitrary material dispersion. It is a full-wave method as precise as the underlying discretization.
APF avoids a slow loop over the M inputs or a slow evaluation of the dense Green's function. The sparsity patterns of A, B, and C are maintained in K and can all be utilized during the partial factorization. Matrices L and U are not as sparse as A, so their evaluation is slow, and their storage is the memory bottleneck for typical direct methods. Since APF does not compute the solution X, such LU factors are not needed and can be dropped during the factorization. This means that APF is significantly better than conventional direct methods even when only one input (M = 1) is considered.
APF is more efficient than computing selected entries of the Green's function A −1 , 57 which does not utilize the structure of Eq. (2). While advanced algorithms have been developed to exploit the sparsity of the input and the output during forward and backward substitutions 58 or through domain decomposition, 59 they still require an O(M ) substitution stage, with a modest speed-up (a factor of 3 when M is several thousands) and no memory  Fig. 3. Two-photon coherent backscattering from disorder. a, Schematic sketch: maximally entangled photon pairs are reflected from a dynamic disordered medium, and the photon-number correlation function Γ ba = ψ| :n bna : |ψ is measured for pairs of reflected angles. b-c, Normalized Γ ba for a single realization (b) and averaged over 4,000 realizations (c).
usage reduction. APF is simpler yet much more efficient as it completely obviates the forward and backward substitution steps and the need for LU factors. Typically, matrix A contains more nonzero elements than matrices B and C, and we find in such scenarios that the computing time and memory usage of APF scale as O(N 1.3 ) and O(N ) respectively in 2D ( Supplementary  Fig. S1), where N is the number of nonzero elements in matrix K and is almost independent of M . When B and/or C contain more nonzero elements than A, we can compress matrices B and C through a datasparse representation to reduce their numbers of nonzero elements to below that of A. For example, a plane-wave source spans a large area, but a superposition of plane waves can be spatially localized and truncated with negligible error (Supplementary Sec. 5 and Figs. S2-S3).
We implement APF under finite-difference discretization on the Yee grid (Supplementary Secs. 2-3), with outgoing boundaries realized by perfectly matched layer (PML). 60 Pseudocodes are shown in Supplementary Sec. 6. Our code supports all common boundary conditions and allows arbitrary permittivity profiles and arbitrary inputs and outputs (user-specified or automatically built); we have made it open-source. 61 Below, we consider two classes of massively multichannel systems, comparing the computing time, memory usage, and accuracy of APF to widely-used opensource electromagnetic solvers including a conventional finite-difference frequency-domain (FDFD) code MaxwellFDFD using either (1) direct 62 or (2) iterative 63 methods, (3) an RGF code, 64 and (4) an RCWA code S4; 65 see the Methods section for details. Time-domain methods typically require more iterations than their iterative frequency-domain counterpart 35 since they iterate by time stepping; we do not include time-domain methods here since the comparisons below are made on monochromatic responses. All computations are done in serial on identical processors. We consider transversemagnetic fields in 2D, starting with systems small enough for these solvers, then with larger problems that only APF can tackle.

II. LARGE-SCALE DISORDERED SYSTEMS
Disordered systems are difficult to simulate given their large size-to-wavelength ratio, large number of channels, strong scattering, and lack of symmetry. Here we consider one W = 500λ wide and L = 100λ thick, consisting of 30,000 cylindrical scatterers with refractive index of 2.0 in air (Fig. 2a), discretized into over 11 million pixels with a periodic boundary condition in y to mimic an infinitely-wide system. On each of the −x and +x sides, 2W/λ = 1,000 channels (plane waves with different angles) are necessary to specify the propagating components of an incident wavefront or outgoing wavefront, which is the minimal requirement at the Nyquist sampling rate (Supplementary Sec. 1A). So, we compute the scattering matrix with M = 2,000 outputs and up to M = 2,000 inputs (including both sides). It takes APF 3.3 minutes and 10 GiB of memory to compute the full scattering matrix; the other methods take between 3,300 and 110,000,000 minutes using between 7.0 to 1,200 GiB of memory for the same computation ( Fig. 2b-c). The computing times of APF (with breakdown shown in Fig. 2d), RGF, and RCWA are all independent of M , though APF is orders-of-magnitude faster. MaxwellFDFD takes O(M ) time due to its loop over the inputs; reusing the LU factors helps, but the M forward and backward substitutions become the bottleneck when M 20. Note that APF offers substantial advantage even in the single-input (M = 1) case.
The speed and memory advantage of APF further grows with the system size ( Supplementary Fig. S4). Some of these solvers require more computing resources than we have access to, so their usage data (open symbols and gray-edged bars in Fig. 2b-c) are extrapolated based on smaller systems (Fig. S4).
The relative 2 -norm error of APF due to numerical round-off is 10 −12 here and grows slowly with an O(N 1/2 ) scaling ( Supplementary Fig. S5), while the iterative MaxwellFDFD here has a relative 2 error of 10 −6 .
In Ref. [66], it is predicted that entangled photon pairs remain partially correlated even after multiple scattering from a dynamic disordered medium. We demonstrate such "two-photon coherent backscattering" as an example here. Given a maximally entangled input state, the correlation between two photons reflected into directions θ a and θ b is where |ψ is the two-photon wave function,n a is the photon number operator in reflected direction θ a , : (. . . ) : stands for normal ordering, r 2 is the square of the medium's reflection matrix (i.e., the scattering matrix with inputs and outputs on the same side), and . . . indicates ensemble average over disorder realizations. This requires the full reflection matrix with all incident angles and all outgoing angles, for many realizations, and the disordered medium must be wide (for angular resolution) and thick (to reach diffusive transport). Figure 3 shows the two-photon correlation function Γ ba computed with APF before and after averaging over 4,000 disorder realizations, for a system W = 700λ wide and L = 400λ thick consisting of 56,000 cylindrical scatterers with a transport mean free path of t = 9.5λ. We find the correlation between photons reflected toward similar directions (|θ b − θ a | 0.1λ/ t ) to be enhanced by a factor of 2. This finding, described in detail in Ref. [66], is the first evidence of two-photon coherent backscattering in disordered media. These APF simulations use modest resources on a computing cluster but would have taken prohibitively long using any other method.

III. LARGE-AREA METASURFACES
Metalenses are lenses made with metasurfaces. 67,68 When the numerical aperture (NA) is high, metalenses need to generate large phase gradients, so the variation from one unit cell to the next must be large, and the locally periodic approximation (LPA) 6,7 fails. Full-wave simulation remains the gold standard. Here we consider metalenses with height L = 0.6 µm and width W ≈ 1 mm, consisting of 4,178 unit cells of titanium dioxide ridges on a silica substrate (Fig. 4a), for a hyperbolic 69 phase profile with NA = 0.86 and a quadratic 70-72 phase profile with NA = 0.71 operating at wavelength λ = 532 nm; see Supplementary Sec. 9 and Fig. S6 for design and simulation details. PMLs are placed on all sides, and the system is discretized with grid size ∆x = λ/40 into over 11 million pixels. We compute the transmission matrix at the minimal Nyquist sampling rate, with up to M = 2W/λ = 3,761 plane-wave inputs from the left (substrate side) truncated within the width W of the metalens, and sampling the transmitted field across width W out = W + 40λ (to ensure all transmitted light is captured) projected onto M = 2W out /λ = 3,841 propagating plane waves on the right. Due to the large (1 mm)/(0.6 µm) aspect ratio, the number of nonzero elements in matrices B and C is larger than that of A, so we compress B and C and denote this as APF-c (Supplementary Sec. 5). It takes APF-c 1.3 minute and 6.9 GiB of memory to compute this transmission matrix, while the other methods take 6,300 to 6,000,000 minutes using 22 to 600 GiB ( Fig. 4b-c); some numbers are extrapolated from smaller systems ( Supplementary Fig. S7). It is remarkable that even though RCWA is specialized for layered structures like the metasurface here, the general-purpose APF-c still outperforms RCWA by 10,000 times in speed and 87 times in memory. The second-best solver here is MaxwellFDFD with the LU factors stored and reused, which takes 4,800 times longer while using 18 times more memory compared to APF-c.
The transmission matrix fully characterizes the metasurface's response to any input. Here we use it with angular spectrum propagation (Supplementary Sec. 11) to obtain the angle dependence of the exact transmitted profile (two profiles each shown in Fig. 5a-b; more shown in Supplementary Movies S1-S2), the Strehl ratio, and To quantify the accuracy of an approximation, we compute the relative 2 -norm error I − I 0 2 / I 0 2 , with I 0 being the intensity at the focal plane within |y| < W/2 calculated from APF without compression, and I from an approximation. We consider two LPA formalisms, a standard one using the unit cells' propagating fields (LPA I) and one with the unit cells' evanescent fields included (LPA II); see Supplementary Sec. 13. LPA leads to errors ranging between 8% and 366% depending on the incident angle, with the angle-averaged error between 18% and 37% ( Fig. 5e-f). Meanwhile, the compression errors of APF-c here average below 0.01% ( Fig. 5e-f) and can be made arbitrarily small ( Supplementary Fig. S8).

IV. DISCUSSION
The APF method opens the door to a wide range of studies. The correlation functions of multi-photon scattering from disorder are challenging to measure experimentally but can now be computed exactly with APF. With volumetric imaging inside strongly scattering media such as biological tissue, [73][74][75][76] it is difficult to experimentally validate against a ground truth and to separate factors like aberrations and dispersion from scattering, but large-scale simulations enabling such studies are now possible with APF. 77 Inverse design using the ad-joint method used to require 2M simulations given M inputs, 17 limiting the number of inputs that can be considered; with a suitable formulation, APF can consolidate the 2M simulations into a single computation. It may now be possible, for example, to design broad-angle nonlocal metasurfaces that reach fundamental bounds. 78 Predicting the thermal emission from nanostructures 79 requires a large number of frequency-domain simulations with different point-dipole sources, which can be vastly accelerated using APF. Classical 18,19 and quantum 20,21 photonic circuits are currently built from small components that couple very few channels at a time, and APF can enlarge the design space to large multi-channel elements with more functionality while occupying a compact footprint.
Beyond photonics, APF can be used for mapping the angle dependence of radar cross-sections, for microwave imaging, 80 for full waveform inversion 81 and controlledsource electromagnetic surveys 58 in geophysics, and for quantum transport simulations. 82 More generally, APF can efficiently evaluate matrices of the form CA −1 B for other applications of numerical linear algebra, not limited to wave physics or to partial differential equations.
As the number of channels and the LU factor size are both much larger in 3D, the advantage of APF over existing methods will be even more significant in 3D than in 2D. With the low-rank property of the finite-difference matrix A utilized, APF should take O(N 4/3 ) time in 3D following the known factorization cost, 83 which is comparable to the 2D scaling ( Supplementary Fig. S1). APF-c can naturally be used with overlapping-domain distri-bution strategies 10-12 when modeling systems with large aspect ratios. Multi-frontal parallelization can be used through existing packages such as MUMPS, 53 and one may employ hardware accelerations with GPUs. 12,84,85 For systems with a small surface-to-volume ratio, it is also possible to apply APF to BEM or T -matrix method, using the H-matrix technique 41 for fast factorization. There is room to further reduce the computing cost of CA −1 B below that of existing factorization algorithms since the LU factors are not needed and do not need to be triangular. This elimination of LU factors makes APF favorable over iterative methods even for single-input scenarios. research; all authors contributed to designing the systems, discussing the results, and preparing the manuscript. Competing interests: The authors declare no competing interests. Data availability: All data needed to evaluate the conclusions in this study are presented in the paper and in the supplementary materials. Code availability: Source codes are available at Ref. [61].

METHODS
We use the MUMPS package 53 with the default AMD ordering 86 to compute the Schur complement with partial factorization. We order the output channels and/or pad additional channels so that matrix K is symmetric (Supplementary Sec. 3).
We use the same second-order finite-difference discretization scheme (Supplementary Sec. 2), same grid size, and same subpixel smoothing 87 for the APF, MaxwellFDFD, and RGF benchmarks. Discretization error is not important for the disordered media example in Fig. 2 In RGF, 64 the outgoing boundary in the longitudinal direction is implemented exactly through the retarded Green's function of a semi-infinite discrete space. 42 For APF and MaxwellFDFD, one λ of homogeneous space and 10 pixels of PML 60 are used to achieve an outgoing boundary with a sufficiently small discretization-induced reflection. 88 Uniaxial PML is used in APF so that matrix A is symmetric. Stretched-coordinate PML is used in MaxwellFDFD to lower the condition number. 89 For MaxwellFDFD with an iterative solver, 63 we use its default biconjugate gradient method with its default convergence criterion of relative 2 residual below 10 −6 . For MaxwellFDFD with a direct solver, 62 we consider an unmodified version where the LU factors are not reused and a version modified to have the LU factors stored in the memory and reused for the different inputs.
For the RCWA simulations, we use its default closed-form Fourier transform formalism implemented in S4. 65 For the example in Fig. 4, we use a single layer with 5 Fourier components per unit cell where the cell width is 239 nm (i.e., 11 Fourier components per λ), which gives comparable accuracy as APF, MaxwellFDFD, and RGF ( Supplementary Fig. S9). For the example in Fig. 2, we use 15 layers per λ axially (same as the discretization grid size used in the other methods) with 4.1 Fourier components per λ laterally (by scaling it down in proportion with the reduced spatial resolution in APF, MaxwellFDFD, and RGF).
Note that the RGF 64 and S4 65 codes do not support an outgoing boundary in the transverse y direction. The RGF and S4 computing time and memory usage in Fig. 4 are extrapolated from simulations on smaller systems adopting a periodic transverse boundary (Supplementary Fig. S7). To simulate the example in Fig. 4 using RGF or S4, one needs to additionally implement PML in the y direction (as in Refs. [90,91]) and to further increase the system width; doing so will slightly increase their computing time and memory usage, which we do not account for.
All timing and memory usage numbers are obtained from serial computations on identical Intel Xeon Gold 6130 nodes on the USC Center for Advanced Research Computing's Discovery cluster with 191 GiB of memory per node.
We start with time-harmonic electromagnetic waves at frequency ω with e −iωt dependence on time t, governed by the frequency-domain Maxwell's equations, where E(r) = (E x , E y , E z )(r) and H(r) = (H x , H y , H z )(r) are the electric and magnetic fields at position r = (x, y, z), ε r (r) and µ r (r) the scalar relative permittivity and permeability profiles that define the structure, J(r) and ρ(r) the electric current density and charge density satisfying continuity equation ∇ · J(r) = iωρ(r) that act as the sources, and ε 0 and µ 0 the vacuum permittivity and permeability constants. Consider a nonmagnetic (µ r = 1) system where ε r (r), E(r), H(r), J(r), and ρ(r) are independent of z. Then, the transverse-magnetic (TM) field components H x , H y , and E z are decoupled from the transverse-electric (TE) components E x , E y , and H z . Inserting the H(r) in Eq. (S1a) into Eq. (S1b) and taking the z component, we obtain the governing equation for the TM waves, where ∇ 2 xy = ∂ 2 /∂x 2 + ∂ 2 /∂y 2 , and c = 1/ √ ε 0 µ 0 is the vacuum speed of light. Once E z is solved for, the magneticfield components of the TM wave follow from Eq. (S1a) as (H x , H y ) = 1 iωµ0 ∂Ez ∂y , − ∂Ez ∂x . Eqs. (S1c)-(S1d) are automatically satisfied. The vacuum wavelength is λ = 2πc/ω.
For the scattering problems considered below, there is no physical current source, so the right-hand side J z is zero. We will put effective sources on the right-hand side of Eq. (S2) to generate the incident field and to solve for the resulting scattered field, but the total field E z still satisfies Eq. (S2) with J z = 0; see Sec. 1 C.
Consider a structure where ε r (x, y) is periodic in y with periodicity W , namely ε r (x, y + W ) = ε r (x, y). Given the periodicity, any solution E z (x, y) of Eq. (S2) can be written as a superposition of Bloch states with different Bloch wave numbers k B , with each Bloch state satisfying E z (x, y + W ) = E z (x, y)e ikBW . Bloch states with distinct k B (up to modulo 2π/W ) are decoupled from each other. Therefore, we can fix k B and consider only a finite transverse size within 0 ≤ y ≤ W , with boundary condition E z (x, W ) = E z (x, 0)e ikBW at y = 0 and y = W ; this is called Bloch periodic boundary condition, which reduces to a periodic boundary when k B = 0.
We further consider the structure to be homogeneous on the left and right sides, Light incident from either side is scattered by the inhomogeneous structure within 0 < x < L, which we refer to as the scattering region. The scattering matrix fully characterizes such response. To define the scattering matrix, we first define the input and output "channels" as follows.

A. Channels in homogeneous space
Consider a homogeneous region (either x ≤ 0 or x ≥ L), within which ε r (x, y) = ε bg is a real-valued constant (either ε L or ε R ). Given the translational symmetry in x, the field profile within the homogeneous region can be written as a superposition of propagating and evanescent fields of the form which we refer to as "channels." The integer index a labels the channel number, ± indicates the propagation direction, x is the longitudinal wave number, the prefactor 1/ k (a) x normalizes the longitudinal flux (to be discussed in the next paragraph), and u a (y) is the transverse mode profile. The a-th transverse profile is with the transverse wave number k (a) y = k B + a 2π W chosen to satisfy the Bloch periodic boundary condition; note that neighboring k (a) y are separated by 2π/W . Here, y 0 is an arbitrary real constant specifying the reference position of the basis. The set {u a (y)} a of transverse modes makes up a complete and orthonormal basis, with There is an infinite number of evanescent channels where k x ). Therefore, 2 √ ε bg W/λ complex-valued coefficients are necessary to fully specify the propagating components of an arbitrary incident wavefront or outgoing wavefront; this is consistent with the Nyquist-Shannon sampling theorem [1] that when the highest spatial frequency (i.e., wave number k y ) in a wavefront is (ω/c) √ ε bg , we need at least one spatial sampling point per λ/(2 √ ε bg ) to uniquely specify such a wavefront.

B. Scattering matrix
We now define the scattering matrix S. In a scattering problem, the total field E z satisfies Eq. (S2) with no source and can be written as Here, E in z is the incident field, and E sca z is the scattered field that satisfies an outgoing boundary condition at infinity (i.e., when |x| → ∞). To be concrete, we consider light incident from the left at angle θ a , with for some constant E 0 . Outside of the scattering region, the total field E z can be written as a superposition of the propagating and evanescent channels consistent with the outgoing boundary condition, as ba are the reflection and transmission coefficients with input in channel a from the left and output in channel b on the left or right, normalized by the longitudinal flux, with reference planes on x = 0 and x = L respectively. In the absence of absorption or gain (i.e., when ε r (x, y) is real-valued everywhere), flux conservation requires that b |r ba | 2 + b |t ba | 2 = 1 for any propagating channel a. Meanwhile,r ba andt ba characterize the evanescent response in the near field.
While it is possible to include the evanescent response in the scattering matrix [2], in most scenarios only the propagating ones are of interest. We define reflection matrix r L and transmission matrix t L with incidence from the left by their matrix elements r ba ; together they form the scattering matrix S L = [r L ; t L ] to include output to both sides. In general, we can consider incident light from either left or right, with the full scattering matrix being This full scattering matrix has size (N L + N R )-by-(N L + N R ) and is unitary in the absence of absorption or gain.

C. Fisher-Lee relation
Here we introduce line sources to the right-hand side of Eq. (S2) to solve the scattering problem. We first define the retarded Green's function G(x, y; x , y ) of the scattering medium, which is the solution of Eq. (S2) with a point source at (x , y ), The infinitesimal absorption η imposes an outgoing boundary condition. We can then use the retarded Green's function to express the total field E z arising from the incident field in Eq. (S6), as which is known as the Fisher-Lee relation, initially derived for the single-particle Schrödinger equation in the context of quantum transport [3][4][5]. Eq. (S11) can be understood intuitively: for each incident angle θ a , −2i k b (y) are the output projections for reflection and transmission coefficients into different outgoing angles θ b , and the δ ba in Eq. (S11a) is the projection of the incident field E in z on the incident side. Eq. (S11) is a special case of Eq. (2) in the main text, written for the specific geometry considered here prior to discretization.

FINITE-DIFFERENCE DISCRETIZATION
We need to discretize the system in order to compute its scattering matrix numerically. Here we consider finitedifference discretization on the Yee grid [6]. On the Yee grid, the first derivatives are approximated with center difference with second-order accuracy, and the grid points of different field components are staggered in space. For 2D TM waves, we put the discretized E z component at with (n, m) being integer indices and ∆x the discretization grid size. The H x component is located at (x n , y m+1/2 ), and the H y component is located at (x n+1/2 , y m ). This gives the discretized second derivative of E z in x as and similarly for ∂ 2 E z /∂y 2 . We discretize the relative permittivity profile ε r (x, y) through subpixel smoothing [7]; for TM waves, this corresponds simply to averaging ε r (x, y) within the ∆x 2 area of each cell centered at (x n , y m ), as Note that different from the notation of Ref. [6], in Eq. (S13) we introduced the half-pixel offset so that the first pixel (n, m) = (1, 1) of E z will have its lower corner at (x, y) = (0, 0); this is more convenient when we only deal with E z . Then, the differential operator −∇ 2 xy − (ω/c) 2 ε r (x, y) of Eq. (S2) is discretized into A (n,m),(n ,m ) /∆x 2 with A (n,m),(n ,m ) = −(δ n+1,n + δ n−1,n )δ m,m + (4 − β 2 ε n,m )δ n,n δ m,m − (δ m+1,m + δ m−1,m )δ n,n (S16) away from the boundaries, with β ≡ (ω/c)∆x. This is the finite-difference frequency-domain (FDFD) formulation. In practice, the system size is made finite, and a single index is used to go through both indices (n, m). With that single index, E z(n,m) becomes an N × 1 column vector E z , and A (n,m),(n ,m ) becomes an N × N square matrix A. Then, Eq. (S2) with no source becomes a finite-sized matrix-vector multiplication, (S17) Now we consider the two-sided geometry in Sec. 1. We restrict the integer index m to 1 ≤ m ≤ n y with n y ≡ W/∆x being the transverse number of grid points (discretizing the region 0 ≤ y ≤ W ); ∆x can be chosen such that n y is an integer. The Bloch periodic boundary condition E z,(n,ny+1) = E z,(n,1) e ikBny∆x and E z,(n,0) = E z,(n,ny) e −ikBny∆x manifests in the boundary elements of matrix A. The system size in x is infinite, but we must truncate it to a finite size with an effective open boundary with no reflection at the interface of truncation; we do so with the perfectly matched layer (PML) [8], which attenuates the outgoing waves with minimal reflection. In this way, the outgoing boundary condition is also built into matrix A. Also, matrix A and vector E z both have finite sizes.
The retarded Green's function G(x, y; x , y ) defined in Eq. (S9) is the inverse of the differential operator. When discretized, it becomes a matrix and is simply given by the matrix inverse (S18) The outgoing boundary condition is automatic since it is already built into matrix A. Under subpixel smoothing, Eq. (S3) becomes where n x ≡ L/∆x is the number of grid points for the scattering region. The rest follows the same steps as in Sec. 1, which we summarize below.

A. Channels in discrete homogeneous space
Given the discrete translational symmetry in n when n ≤ 0 and n ≥ n x + 1, Eq. (S4) still holds in the form of with the longitudinal-flux normalization factor ν a to be determined. The transverse modes of Eq. (S5) become which make up matrix u, with y 0 = (m 0 −1/2)∆x. The transverse wave number is still k (a) y = k B +a 2π ny∆x , but channel a is now equivalent to channel a + n y due to aliasing, so there are only n y distinct channels in total (propagating ones plus evanescent ones) instead of an infinite number of them. Completeness and orthonormality of the transverse modes means that the n y × n y matrix u is unitary when all n y channels are included. Inserting Eqs. (S20)-(S21) into Eq. (S17) yields the finite-difference dispersion relation Flux orthogonality continues to hold in the discretized system. To preserve flux conservation (i.e., Poynting's theorem), the Poynting vector in the discrete system should be defined using the product of an E field component and an H field component at ∆x/2 away [9]. For 2D TM felds, this means that the x-directional flux is proportional to Im E * z(n,m) E z(n+1,m) . For each channel in Eq. (S20), such flux is proportional to ±1 |νa| sin k (a) x ∆x for propagating channels (where k is imaginary or complex-valued). Therefore, we choose the the flux normalization factor in Eq. (S20) to be ν a = sin k (a) x ∆x (S23) so that all propagating channels carry equal longitudinal flux.

B. Finite-difference Fisher-Lee relation
The definition of the scattering matrix is the same as the continuous case in Eq. (S7), simply with x and y replaced through Eq. (S13).
The derivation of the scattering matrix in terms of the Green's function (namely, the Fisher-Lee relation) is the same as the continuous case except for one difference: the sources and the projections were placed at x = 0 and x = L in the continuous case, but the corresponding spatial indices n = (x/∆x) + (1/2) would lie on n = 1/2 and n = (L/∆x) + (1/2), which are generally not integer points. Therefore, for the discretized system, we put the sources and the projections at integer indices n = 0 and n = n x + 1 (recall that n x ≡ L/∆x ). This gives the discrete version of Eq. (S11) as which is unitary in the absence of absorption or gain if all propagating channels are included. A schematic illustration of Eq. (S25) is given in Fig. 1c of the main text. Matrix A is the discrete version of the differential operator −∇ 2 xy − (ω/c) 2 ε r (x, y) in Eq. (S2), given by Eq. (S16), PML, and the boundary conditions. The input matrix B is The top (bottom) block row of zeros correspond to indices n < 0 (n > n x + 1) with PML and homogeneous space on the left (right); they are shown in green in Fig. 1c-d of the main text. The second (fourth) block row corresponds to index n = 0 (n = n x + 1), which is the left (right) surface where input sources are placed; they are shown in red in Fig. 1c-d. The third block row corresponds to the scattering region with indices 1 ≤ n ≤ n x , shown in blue in Fig. 1c-d. Matrices B L and B R have sizes n y × M L and n y × M R ; they are line sources on the surface, given by are M L × M L diagonal matrices for flux normalization and phase shift respectively, and similarly with B R . Similarly, the output matrix C performs the projection onto the output channels, with where † stands for conjugate transpose. Note that the list of the M = M L + M R output channels do not have to be the same as the list of the M = M L + M R input channels, but for simplicity we do not introduce a separate notation. Matrix D here is the δ ba in the Fisher-Lee relation, with elements that equal δ when the input channel is the same as the output channel, 0 otherwise.
For the numerical computations and implementation, it is faster and simpler to take the prefactor −2i in Eq. (S27) and the √ ν L/R δ (L/R) prefactors in Eq. (S27) and Eq. (S29) out of B L/R and C L/R , and multiply such prefactors after CA −1 B is computed. This means we can use instead for the numerical computations. We omit D from matrix K [as in Eq. (S31) below] and subtract D from CA −1 B after the prefactors are put back. While the details are system dependent, the concepts above are general, and scattering matrices can always be written in the form of Eq. (S25) regardless of the discretization scheme, the geometry, the type of sources, and the type of outputs of interest. In general, D = CA −1 0 B − S 0 , where A 0 is the Maxwell operator of a reference system (e.g., a homogeneous one) for which the scattering matrix S 0 is known; in the specific case above, CA −1 0 B = I + S 0 where the identity matrix I comes from projection of E in z on the incident side (assuming the full scattering matrix and ignoring the phase-shift factors to simplify notation).

UTILIZING SYMMETRY IN AUGMENTED PARTIAL FACTORIZATION (APF)
In APF, a partial factorization is performed on the augmented sparse matrix For most linear solvers such as the MUMPS package [10] we use, the computing time and memory usage of such partial factorization can be reduced when matrix K is symmetric. Thanks to reciprocity, the bulk of matrix A as in Eq. (S16) is symmetric; periodic boundary condition and the use of uniaxial PML does not break such symmetry. Therefore, matrix K will be symmetric if we can make C = B T , or equivalently C L = B T L and C R = B T R . From Eq. (S30), we see that when the list of input channels equals the list of output channels, we would have the desired C L = B T L if the transverse mode profiles were real-valued. The transverse mode profiles in Eq. (S21) are not real-valued, but we can see that taking the complex conjugate of the profile is equivalent to flipping the sign of k (a) y ; with a periodic boundary condition in y (where k B = 0), this corresponds to flipping the sign of the channel index a. Therefore, we can achieve C L = B T L simply by making the list of output channels having the opposite channel index as the list of input channels. When the full scattering matrix or reflection matrix is computed, all we need to do is to choose a particular ordering of the channels (which can be reversed after CA −1 B is computed). When only a subset of the scattering matrix is needed, we can further pad input and/or output channels to achieve C L = B T L to make matrix K symmetric. Note this is applicable to any structure and does not require any symmetry in ε r (x, y).

COMPUTING TIME AND MEMORY SCALING OF APF
We implement APF under finite-difference discretization for 2D TM fields as described above, and here we map out its computing time and memory usage scaling as a function of the system size. We consider the same disordered system as in Fig. 2  Here, nnz(K) is dominated by nnz(A), with nnz(A)/nnz(B) in between 1 and 100 among these systems. Therefore, nnz(K) ≈ nnz(A), which is proportional to the number of pixels in the discretization and independent of the number M of input channels.

APF WITH COMPRESSED INPUT/OUTPUT MATRICES (APF-C)
When nnz(B) and/or nnz(C) exceeds nnz(A), the computing time and memory usage of APF can grow with the number of inputs M , which is not desirable. But there is a simple solution: we can "compress" matrices B and C to reduce their number of nonzero elements, perform the partial factorization, and then "decompress." Conceptually, this is similar to other forms of data compression. We call this APF-c with c standing for compression.
In this section, we consider a simple compression scheme based on the Fourier transform. This Fourier scheme is more than sufficient for the metasurface examples considered in this paper; note when the accuracy is sufficient and when nnz(B) and nnz(C) goes below nnz(A), further compression is no longer necessary. More advanced compression strategies [11] may be used if there is such need. Figure S2a illustrates the expression of the scattering matrix S in Eq. (S25), highlighting the nonzero blocks B L , B R of the input matrix B and the nonzero blocks C L , C R of the output matrix C. These nonzero blocks are placed on the front (x = 0) and back (x = L) surfaces of the scattering region, as indicated in red in Fig. 1d of the main text. Even though matrices B and C are nonzero only on these two surfaces, these nonzero blocks are dense and can still contain many elements when the system is wide. Each column of B L is the cross section of a plane wave given in Eq. (S21), which covers the full range of y as shown in Fig. S3a.
While a single plane wave is extended in space, a superposition of plane waves can form a sharp focus, with the focus location determined by the relative phases of the constituting plane waves. And such a conversion between angular basis (plane waves) and spatial basis (focused waves) is invertible. With this idea in mind, we will take a discrete Fourier transform of the dense block B L along its channel index a, and similarly with B R , C L , and C R . For Fig. S2. Concept of APF with compression (APF-c). a, Schematic illustration of Eq. (S25) that highlights the dense blocks BL, BR and CL, CR of the input and output matrices. b, With APF-c, the input and output matrices are transformed such thatBL,BR,CL,CR are spatially localized (see Fig. S3) and can be truncated with minimal or no loss of accuracy. The transformations are reversed afterCA −1B is computed.
concreteness, below we consider having M L input channels from the left with consecutive channel indices a = − ML−1 2 , · · · , 0, · · · , ML−1 2 , under periodic boundary condition with m 0 = 0. We define an M L ×M L discrete Fourier transform (DFT) matrix F ML by its elements, The inverse DFT matrix is then F −1 ML = F † ML . Note that the indices b and a here are centered around zero; the DFT matrix is commonly defined with indices starting at zero instead, which is equivalent to the definition here after an index shift.
When taking a superposition of plane waves to form a focus, the weight of the constituting plane waves can determine how fast the intensity decays away from the focal spot. Therefore, we also introduce a diagonal matrix Q ML , defined as where q ML a a are nonzero real numbers (the weights). Then, matrix B L can be written as Similarly, The transformed input and output matricesB andC can be defined just like Eq. (S26) and Eq. (S28). Instead of computing CA −1 B, we can computeCA −1B with APF using the transformed input/output matrices, after which we undo the transformations using E L , E R , E L , E R . The procedure is illustrated in Fig. S2b. So far, we have not introduced compression yet; the transformed matrices take up the same sizes as the original matrices, and the transformations can be reversed with no approximation.
While the columns of the original matrix B L are spatially extended, those of the transformed matrixB L can be made spatially localized. Without the weights (i.e. Q ML = I), the matrix elements ofB L can be analytically derived from Eq. (S21) and Eq. (S32) as which is the discrete form of the sinc function coming from Fourier transforming a rectangular window (i.e., |k y | < ω/c) in momentum space. The a-th column ofB L in Eq. (S36) is peaked around a center point y Given the spatial localization ofB L , we can specify a truncation window width w t and set the matrix elements ofB L with |y − y (a) c | > w t /2 to zero to make matrixB L sparse-this is the compression step. Such truncation significantly reduces nnz(B L ), by a factor of w t /W . There is no need to build the full matrix u L ; we only need to build the elements ofB L within the truncation window w t using Eq. (S36). There is no need to build the DFT matrix F ML either, since the transformations can be reversed efficiently with fast Fourier transforms [12].
The truncation does introduce small errors since the elements dropped were not exactly zero before. Fig. S3c plots one column ofB L in log scale, which makes the nonzero nature of the 1/ y − y (a) c tail more obvious. To reduce the compression error, we can increase the window size w t and/or make the columns ofB L decay faster. The relatively slow 1/|y − y (a) c | decay in real space comes from the sharp edges of |k y | < ω/c in momentum space. So, we can makẽ B L more localized using weights that smoothen the sharp edges. Here, we use the Hann window function [13]: As Eq. (S37) is a superposition of exponential functions, the matrix elements ofB L = u L Q ML F ML can be readily derived as where each term is given by Eq. (S36). Fig. S3d-e illustrate this process. With the Hann window, the columns ofB L now decays significantly faster as 1/|y − y (a) c | 3 , shown in Fig. S3f. Additionally, we can also reduce the compression error by padding extra channels. The sin (πmM L /n y ) term in the numerator of Eq. (S36) has a width of (n y /M L )∆x in y. Therefore, we can pad extra channels to increase M L , which reduces the width of the dominant peak, makingB L more localized and reducing compression error for the same window size w t . Note that the extra channels we pad do not need to be propagating channels; evanescent ones works equally well since only the transverse profile is involved. The padded channels can be removed after computing CA −1B . In the limit where the padded M L reaches n y , each column ofB L will be zero everywhere except at the pixel of m = m (a) c , so the compression error completely vanishes even when the truncation window w t is a single-pixel (∆x) wide; this is equivalent to computing the scattering matrix in an overcomplete spatial basis.
In Sec. 14, we will give a detailed analysis of the compression error for the mm-wide metasurface examples considered in the main text; by choosing the truncation window size w t and the number of channels to pad, we can make the compression error arbitrarily small.
Lastly, we note thatB L in Eq. (S36) and Eq. (S38) is already real-valued, soC L =B T L whenever M L = M L . The channel-index flipping described in Sec. 3 no longer necessary for making matrix K symmetric.

APF PSEUDOCODE
The pseudocodes of APF and APF-c are shown below, which is also the structure of our implementation made opensource at [14]. One can specify an arbitrary system contained in input argument syst (including the permittivity profile, wavelength, discretization grid size, boundary conditions, and PML parameters), arbitrary lists of source profiles given by matrix B, and arbitrary output projections given by matrix C. The algorithm returns the scattering matrix S.  Figure S4 shows how the computing time and memory usage scale with the system size using different methods, for the disordered system in Fig. 2 of the main text with the aspect ratio W/L = 5 fixed while the overall system size (W and L) varies. Some methods require more computing resources than we have access to, so their data points (open symbols) are extrapolated from smaller number of input angles M and/or smaller systems. We note that among all of these methods, APF exhibits the best scaling both in terms of computing time and in terms of memory usage, and is many orders of magnitude faster for large systems.

ROUND-OFF ERROR OF APF
While APF is in theory exact aside from discretization error (and compression error if APF-c is used), numerical round off is also present, and such round-off errors often grow with the system size. Therefore, it is necessary to check if the round-off error of APF is within the acceptable range. To characterize the round-off error, we compare results from APF to those obtained from a standard direct solver but with additional iterative refinement [15] steps that iterate until the entire solution (with all input channels) reaches machine-precision accuracy. Double precision is used through out.
From that, we evaluate the relative 2 -norm error, S APF − S 0 2 / S 0 2 , for the systems in Sec. 7, where S 0 is the scattering matrix (reshaped into a vector) computed with iterative refinement, and S APF is that from APF. The difference between the two is the round-off error of APF. The relative round-off error with respect to N = nnz(K) scales as O(N 0.5 ) (Fig. S5); it is only 10 −12 for the system considered in Fig. 2 of the main text where N ≈ 10 8 , and is estimated to be only 10 −10 even for an extremely large system with a trillion matrix elements. We therefore conclude that the round-off error of APF is negligible even for the largest system one would possibly simulate.

METALENS DESIGN AND SIMULATION
We start by describing how the hyperbolic and quadratic metalenses in Figs. 4-5 of the main text are designed. The metalens operates at wavelength λ = 532 nm and consists of 4,178 unit cells. Each unit cell (i.e, a meta-atom) has a titanium dioxide (TiO 2 ) ridge (refractive index n = 2.43) sitting on a silica substrate (n = 1.46) in air (n = 1), as shown in Fig. S6a. The ridge height in x is fixed at L = 600 nm, and the width Λ of an unit cell is fixed at 239.4 nm. We consider ridge widths between 45 nm and 200 nm. We use a grid size of ∆x = λ/40 for the finite-difference discretization, which ensures that the transmission phase shifts are accurate to within 0.1 radian (see Sec. 15). We then map out the phase and amplitude of the zeroth-order (i.e., a = b = 0) transmission coefficient of the unit cell with Bloch periodic boundary condition in y, for different ridge widths and different incident angles, as shown in Fig. S6b-c. From these results, we pick eight ridge widths as indicated by the green arrows in Fig. S6b-c and summarized in Table S1, which provide eight equally-spaced transmission phase shifts covering 0 to 2π at normal incidence.
For a hyperbolic metalens, the transmission phase shift at normal incidence should be space-dependent with a hyperbolic profile [16] Table S1. Ridge widths used for the metalens design. Eight ridge widths and the corresponding relative transmission phase shifts at normal incidence.
achieve diffraction-limited focusing at normal incidence but comes with off-axis aberrations at oblique incidence [17]. For a quadratic metalens, the transmission phase shift should follow a quadratic profile [18] Φ quadratic (y) = − 2π λ We construct the metalenses from the eight unit cells in Table S1, with the ridge width of each unit cell chosen based on the desired normal-incidence transmission phase shift at the center of that unit cell. The metalenses consists of 4,178 unit cells, with an overall width of W = 1000.2 µm. For the hyperbolic metalens, we use a focal length of f = 300 µm (corresponding to numerical aperture NA = 0.86). For quadratic metalenses, there is a maximal effective numerical aperture of NA eff = n t / √ 2 where n t is the refractive index of the medium on the transmitted side [19], so we use a larger focal length of f = 500 µm (corresponding to NA = 0.71).
Such design, although standard, is quite simplistic as it only considers normal incidence and assumes that the unit-cell simulations (which are carried out for fully periodic systems) are sufficient for capturing the response of the aperiodic metalens. The purpose of this design here is not to realize a high-performance metalens, but simply to provide a test system to benchmark the APF method.
After building ε r (x, y) of the mm-wide metalens, we carry out full-wave simulations with APF-c to compute its transmission matrix. The simulation domain is schematically illustrated in Fig. 4a of the main text, with PML on all four sides to describe a fully open system; also see the Methods section of the main text. The pre-compression inputs are line sources in the silica substrate immediately behind the TiO 2 ridges, which generate incident plane waves truncated with a rectangular window over the width W of the metalens; this models the effect of having an aperture that blocks incident light beyond width W in = W . We use the set of propagating input channels for a periodic boundary in y with width W in , and restrict to the 2W in /λ = 3,761 incident angles with |θ substrate in | ≤ asin(1/n substrate ) = 43 • (namely, |θ in | ≤ 90 • in air); as described in Sec. 1 A, this is the minimal number of channels we need to specify wavefronts incident from air prior to entering the substrate. For the output projections, we take the total field E z (x = L, y) immediately after the TiO 2 ridges across a width W out = W + 40λ in y that is sufficiently large to contain all of the transmitted light, and project it onto the 2W out /λ = 3,841 propagating plane waves in air with |θ out | ≤ 90 • ; this is the minimal number of output projections we need to specify the propagating components of the transmitted light. The input and output matrices are compressed for the computations, following the steps in Sec. 5.

SYSTEM-SIZE SCALING FOR METASURFACE SIMULATIONS WITH RCWA AND RGF
As RCWA and RGF work with dense matrices, they do not scale well with the system size, and simulating the mm-wide metasurface requires more computing resources than we have access to. Therefore, we extrapolate from smaller systems. We fix the thickness of the metasurface, and run RCWA and RGF simulations with increasing metasurface width W ; the computing time and memory usage are shown as blue circles in Fig. S7. For both methods, the computing time scales as O(W 3 ), which is the time scaling to invert and to multiply size-O(W ) dense matrices; we fit the data with O(W 3 ) curves (green dashed lines) and extrapolate to a mm-wide metasurface (red circle). Similarly, the memory usages scale as O(W 2 ), which we use for extrapolation.
As described in the Methods section of the main text, the RCWA and RGF simulations here adopt a periodic boundary condition in y with width W (not W out ), without free space and without PML. So, the numbers here are a lower bound of the would-be numbers when an open boundary in y is implemented.

ANGULAR SPECTRUM PROPAGATION
We use angular spectrum propagation (ASP) to obtain field profile in the free space after the metalens. We express E z (x, y) in terms of its Fourier componentsẼ z (x, k y ) as Plugging Eq. (S41) into the source-free Eq. (S2) with x ≥ L [where ε r (x, y) = 1] gives ∂ 2 ∂x 2Ẽz (x, k y ) = −k 2 x with k x = (ω/c) 2 − k 2 y . As there is no light incident from the right, light must propagate or decay to the right, sõ  Therefore, given the total field E z (x = L, y) immediately after the metasurface, we can take its Fourier transform to obtainẼ z (x = L, k y ), propagate it forward with Eq. (S42), and obtain the field profile anywhere in the free space after the metalens with Eq. (S41). This method is called "angular spectrum propagation" (ASP) [20]. The evanescent components (for which |k y | > ω/c) decay exponentially. Since we are not interested in the near-field close to the metasurface, we can ignore the evanescent components, and replace the ∞ −∞ dk y integration range with ω/c −ω/c dk y . Therefore, to perform ASP, we only need the propagating Fourier components of E z (x = L, y), which are precisely what's contained in the transmission matrix we computed with APF-c.
There is one subtlety. In practice, the continuous integration over k y must be approximated with a discrete summation. The transmission matrix elements have the transverse wave number k y of the transmitted light evenly spaced by 2π/W out as described below Eq. (S5). If we perform ASP and the approximated integration directly with such 2π/W out momentum spacing, it will introduce an artificial periodic boundary with periodicity W out ; light that reaches the boundary will wrap around and reenter from the other side instead of leaving the domain of interest. Therefore, we need a transverse momentum spacing finer than 2π/W out . To achieve so, we first evaluate E z (x = L, y) using Eq. (S7), restricting the summation over b to the propagating components contained in the computed transmission matrix. Then, we evaluate its Fourier components on a finer grid of transverse momentum k y spaced by 2π/W ASP . Here, we use W ASP ≈ 2W , which is sufficient for eliminating the periodic wrapping artifacts within the domain of interest (x f , |y| < W/2). The integration range ∞ −∞ dy in Eq. (S43) can be truncated to the width W out where we perform the output projections, since the field beyond such window is negligibly small. A high resolution of ∆x = λ/40 is not necessary for this spatial integration since the refractive index is lower (n = 1 in air) and since only the propagating components (which vary slowly in y) are of interest, so we use a coarser resolution of ∆x = λ/8 for the integration in Eq. (S43). The evaluations of Eq. (S7), Eq. (S43), and then Eq. (S41) can all be done efficiently with fast Fourier transforms [12].
ASP is an exact method; when the continuous integrals are not replaced by discrete summations, ASP is mathematically equivalent to the Rayleigh-Sommerfeld diffraction integral [21] used in Ref. [22] where it was referred to as the equivalent-current formulation.

METALENS TRANSMISSION EFFICIENCY AND STREHL RATIO
The transmission efficiency and the Strehl ratio are important metrics for assessing the performance of a metalens. Here we evaluate the incident-angle dependence of these quantities using the transmission matrix computed with APF-c. We define the transmission efficiency as the total transmitted flux in x direction divided by the total incident flux in x direction. Since our definition of the transmission matrix [as in Eq. (S7)] is already flux-normalized, the transmission efficiency of a truncated incident plane wave with angle θ (a) in is simply The Strehl ratio is defined as the actual intensity at the focal spot |E z (x = L + f, y = f f | 2 divided by the would-be intensity |E (ideal) z (r f )| 2 at the focus if perfect diffraction-limited focusing were achieved with the given transmission efficiency. Since the focus location y f depends on the incident angle and is not clearly defined at large angles where aberrations are significant, we evaluate |E z (x = L + f, y = f f | 2 as max y |E z (x = L + f, y| 2 . With the transmission efficiency adjusted, the Strehl ratio is therefore SR(θ Oftentimes, only the zeroth-order (i.e., a = b = 0) transmission coefficient of the unit cell is considered. Therefore, LPA I uses z (x = L, y), we then use the same angular spectrum propagation procedure to obtain the approximate E 14. APF-C COMPRESSION ERROR As described in Sec. 5, the APF-c compression error can be reduced to arbitrarily small by increasing the truncation window w t and/or by padding M pad additional channels. Here, we characterize the APF-c compression error of the metalens systems. As described in the main text, we compute the relative 2 -norm error I − I 0 2 / I 0 2 , with I 0 being a vector containing the intensity |E z (x = L + f, y)| 2 at the focal plane within |y| < W/2 calculated from APF without compression, and I from APF-c. Fig. S8 plots the error as a function of the incident angle for varying w t and M pad . The Hann window is used, and the same w t and M pad are used on both the left (incident) and the right (transmitted) sides.

DISCRETIZATION ERROR
The grid size ∆x in the finite-difference discretization and the number of Fourier components in RCWA are chosen to ensure sufficient accuracy, which we describe here.
For metalenses, the most important property is the transmission phase shift. Figure S9a shows the discretization error of the zeroth-order transmission phase shift, Φ ∆x=λ/40 FDFD − Φ ∆x=λ/240 FDFD , of the unit cells described in Sec. 9. We see that the discretization error is negligible away from the resonances, and the error at the resonances arises because the angle at which the resonance exists is highly sensitive on the structure. Here, the wrapped |Φ ∆x=λ/40 FDFD − Φ ∆x=λ/240 FDFD | averaged over angles and ridge widths is 0.11 radian. So, we use ∆x = λ/40 for the metasurface simulations using APF and using MaxwellFDFD.

CAPTIONS FOR SUPPLEMENTARY MOVIES
Movie S1. Intensity profile of light transmitted through the mm-wide hyperbolic metalens as the incident angle varies. The profiles are normalized such that the incident flux is the same for all incident angles, and the colorbar is saturated near normal incidence in order to show the profiles at oblique incidence. The Strehl ratio and transmission efficiency is also shown.