Circuit complexity of quantum access models for encoding classical data

Classical data encoding is usually treated as a black-box in the oracle-based quantum algorithms. On the other hand, their constructions are crucial for practical algorithm implementations. Here, we open the black-boxes of data encoding and study the Clifford$+T$ complexity of constructing some typical quantum access models. For general matrices, we show that both sparse-access input models and block-encoding require nearly linear circuit complexities relative to the matrix dimension, even if matrices are sparse. We also gives construction protocols achieving near-optimal gate complexities. On the other hand, the construction becomes efficient with respect to the data qubit when the matrix is the linear combination polynomial terms of efficient unitaries. As a typical example, we propose improved block encoding when these unitaries are Pauli strings. Our protocols are built upon improved quantum state preparation and a selective oracle for Pauli strings, which hold independent value. Our access model constructions offer considerable flexibility, allowing for tunable ancillary qubit number and offers corresponding space-time trade-offs.


I. INTRODUCTION
Quantum computing offers speedups over the classical counterpart in different tasks, including factoring, searching, simulation, etc [1].However, the speedups, in many cases, rely on the existence of efficient oracles or access models to encode the related classical data [2].In this context, a function  () representing the classical data of interest is encoded using a unitary operation   , which acts as an oracle in the computation.To study quantum advantages, the number of queries to   in a quantum algorithm is compared to the number of queries to  () in classical algorithms.Quantum computing provides substantial reduction in query complexity for many problems of practical importance [3][4][5].
There are various access models to encode classical data.One commonly used access model is the sparse-access input model (SAIM) [4][5][6][7][8][9][10][11][12], which encodes general sparse matrices and outputs the value or position of the non-zero elements when provided with appropriate inputs.SAIM is initially introduced for Hamiltonian simulation and discrete quantum walks [4, [6][7][8], and has then found broad applications in other fields such as machine learning [5,9,11] and classical oscillator simulations [12].For example, the quantum linear system problem could be solved with Õ () queries to SAIM [5,9,10], where  represents the condition number of the matrix to be inverted.
Another important access model is block-encoding, which serves as a crucial subroutine for quantum signal processing [13,14] and its generalization -quantum singular-value transformations (QSVT) [10,15].The success of blockencoding enables the realization of Hamiltonian simulation * phyxmz@gmail.comwith an optimal query complexity [13,14].Furthermore, many seminal quantum algorithms, including Grover's algorithm, quantum Fourier transformation, and the HHL algorithm, could be viewed as special cases of QSVT, where the problem of interest is encoded using block-encoding [15].
Many existing works treat access models as black boxes for convenience.However, the actual circuit complexity of the algorithm also depends on the cost of each query to these access models.While being important, this problem only draws much attention very recently with many basic problems still left open.In particular, Ref [16] presents a nearly time-optimal protocol for block-encoding of general dense matrices of 2  ×2  dimension.A circuit depth of Õ () can be achieved at the expense of exponential ancillary qubits.Ref [17] examines matrices with  data each appearing  times and considers examples including checkerboard matrices and tridiagonal matrices with polynomial circuit complexities.However, the cost of block-encoding of more general matrices remains unexplored.Moreover, it is still unclear if there is a fundamental limit to the resource required by data encoding.
In this work, we provide a framework of constructing quantum access models in the fault-tolerant setting using Clifford+ gates.The protocol works for general classical data and takes the underlying structure of the data, such as sparsity and linear combintaion of unitaries (LCU), into consideration.Our results represent a direct mapping from the query complexity of quantum algorithms to their practical circuit complexity.Our protocols allow tunable ancillary qubit numbers and offer space-time trade-off.For general sparse matrices of dimension 2  = , we investigate the SAIM and block-encoding.For both access models, we first show that the gate count lower bound increases about linearly with respect to .We then develop construction algorithms with varying ancillary qubit numbers ranging from Ω() to  ().Across the en-tire range of qubit numbers, we achieve nearly optimal circuit complexity.We next study the block-encoding of LCU.Efficient block-encoding is achievable when the matrix can be represented as a linear combination of a polynomial number of unitaries, which can be implemented using polynomial-size quantum circuits.
Our access model construction relies on optimized realizations of various subroutines that are independently valuable, including quantum state preparation, selective oracles for Pauli strings, and sparse Boolean functions.In all the listed operations, we achieve improved or at least comparable circuit complexities compared to the best-known realizations.
We now introduce the definition of SAIM and blockencoding in below.Let  = 2  , we consider a sparse matrix  ∈ C  ×  with at most  =  (1) nonzero elements at each row and column.Let  , be the value of the element at the th row and th column, and each  , is a -digit integer ( =  (1)).Let idx denote a 2-qubit index register, and wrd denote an -qubit word register, the sparse-access input model (SAIM) corresponds to two unitaries   ,   , which satisfies Here,  (, ) is the column index of the th nonzero element in row .Due to its simplicity and generality, Eq. ( 1) becomes one of the standard access models in quantum computing, which is usually assumed to be available in processing classical data.We call a unitary  the block encoding of  if we have where  > 0 is the normalization factor,  anc is the number of ancillary qubits, and I  is the -dimensional identity.
In practice, we may consider approximated construction of the block encoding.More specifically, we call unitary Ũ an (,  anc , )−block-encoding of  if for error parameter  ≥ 0. Throughout our manuscript, ∥ • ∥ represents either the spectral norm for matrices or Euclidean norm for vectors.For a general -dimensional matrix , the construction of its block-encoding requires Ω(Poly()) gate count.This is true even for sparse  as we show in Supplementary Discussion 2.
On the other hand, when  has some other structures, the resource may be significantly reduced.In particular, we consider  in the form of a linear combination of unitaries (LCU) as where   are -qubit unitaries that can be implemented with polynomial-size quantum circuit, and  =  (poly()).The concept "LCU" appeared firstly in [18].The main purpose of Ref. [18] and the follow-up work Ref. [19] is to realize nonunitary transformation on quantum computers.In the context of Hamiltonian simulation, Ref. [20] has shown that LCUbased method can outperform product formula based methods.Many subsequent works with different applications have then been inspired [14,15,[21][22][23].
Without loss of generality, we may assume that log 2  is an integer, and −1 =0   = 1.This can always be satisfied by adding terms with zero amplitude, and rescaling the Hamiltonian.In particular, the linear combination of Pauli strings will be studied in details.Here,   > 0,  ⩾ 1,   =  =1  , , and  , ∈ {±, ±, ± , ± } are single-qubit Pauli operators.Eq. ( 4) is important as it corresponds to the Hamiltonian of almost all physical quantum systems, such as the spin and molecular systems.
In our constructions, we consider the fault-tolerant quantum computing setting.More specifically, we only use two-qubit Clifford gate and single-qubit  gate, which is equivalent to the elementary gate set G clf+ ≡ {, , , CNOT}.All gates in G clf+ are error-correctable with surface code [24].We benchmark the circuit complexity of a given quantum circuit with three quantities: total number of elementary gates, total qubit number (including data qubits and ancillary qubits), and circuit depth.We will also discuss the space-time trade-off of our algorithm, i.e. the circuit depth under a certain number of ancillary qubits.We also allow at least  () ancillary qubits, because this does not increase the total space complexity.

Circuit complexity lower bound
Before discussing the access model construction, we first study the lower bound of the circuit complexity.We will focus on the encoding of sparse matrices.The methodology here is general and can be readily applied to other related problems.
Our strategy is as follows.Firstly, we analyze the capacity of a quantum circuit with bounded resource, i.e. how much unique unitaries can be constructed, given fixed number of elementary gates or circuit depth.Secondly, we analyze the size of the access model, i.e. the number of unique unitaries required to approximate the access model with arbitrary parameters.The circuit complexity can then be estimated by comparing the capacity of a quantum circuit and the size of the access model.All proofs of our lemma and theorems in this section are provided in Supplementary Discussion 1.

Quantum circuit capacity.
Assuming that we are given a finite two-qubit elementary gate set G ele .We define  ≡ |G ele | =  (1) with | • | the number of elements in the set.Our first result is that the capacity can be lower bounded only with the number of elementary gates, independent of the space and time resources.
Lemma 1.Let G  be the set containing all -qubit unitaries that can be constructed with  elementary gates in G ele .Then, we have log |G  | =  (( log( + ))), even with unlimited ancillary qubit number.
Lemma. 1 implies that the capacity does not always increase with ancillary qubit number, which can be understood as follows.All ancillary qubits should be uncomputed at the end of the circuit.When  is fixed, only finite number of unitaries can satisfy this requirement, while constructable by those elementary gates.We also note that the circuit depth  is bounded by , so Lemma. 1 also implies a relation between capacity and circuit depth.
On the other hand, when the ancillary qubit number and circuit depth are finite, the lower bound of capacity can be tighten as follows.

Lemma 2. Let G ′
anc , be the set containing all unitaries that can be constructed with  anc ancillary qubits and  circuit depth.Then Circuit complexity for encoding sparse matrices.With Lemma. 1 and 2, we now estimate the circuit complexity lower bound for accessing sparse matrices.For SAIM, it turns out that at least Ω(!) unique unitaries are required to cover the set of all SAIM for 1-sparse matrices.So according to Lemma. 1, 2, we have the following result.
Theorem 1.Given an arbitrary finite two-qubit elementary gate set G ele .Let  anc ,  and  be the number of ancillary qubits, circuit depth and total number of gates in G ele required to approximate SAIM in Eq. (1) with any accuracy  < 1.Then, we have ( +  anc ) = Ω(2  ) and  = Ω(2  ).
A similar result is also obtained for the block-encoding of sparse matrix as follows.
Theorem 2. Given an arbitrary finite two-qubit elementary gate set G ele .Let  anc ,  and  be the number of ancillary qubits, circuit depth and total number of gates in G ele required to construct the block-encoding of  with any accuracy  < 2.Then, we have ( +  anc ) = Ω() and  = Ω(  ) for arbitrary  ∈ (0, 1).
Theorem. 1, 2 imply that a general sparse matrix can not be encoded with subexponential quantum gates, for both SAIM and block-encoding.It is possible to trade ancillary qubit numbers for the circuit depth.However, the space and time complexities can not achieve sub-exponential scaling simultaneously.The hardness of SAIM can be interpreted as follows.
Although  is assumed to be sparse ( (1) nonzero elements at each row and column), there are still totally 2  × (1) =  (2  ) number of independent variables in total.Therefore, the quantum circuit should be large enough to contain exponential number of elementary gates.
We note that the quantum circuits capacity for ancillaryfree case has been studied in Section 4.5.4 of [1].Moreover, a related result to Theorem. 1 has obtained in [25], which gives a distinct quantum circuit number lower bound with fixed qubit number, and show that there exists a table of size  requiring Ω() gate count.Ref. [1] allows approximated implementations, but does not consider ancillary qubit usage.Ref. [25] implicitly allows ancillary qubits, but does not consider approximated implementations.On the contrary, our results are more general, because both ancillary qubit usage and approximated implementations are allowed.Our results can be generalized from unitary to quantum channels.In Supplementary Discussion 1, we show that the circuit capacity and circuit lower bound are similar if we consider two-qubit quantum channels as elementary quantum operations, which can include measurement and feedback controls.

𝒏
Here, we provide a family of improved quantum state preparation protocols with tunable ancillary qubit number.The result is summarized in below (follows directly from Theorem. 8 in Methods).TABLE II.Clifford+T complexities of -qubit state preparation protocols with fixed  and exponential ancillary qubits.The  scaling of Clifford+ count and depth are  (log(1/)) for all schemes.Total qubit numbers are  (poly()) for Ref [33],  ( log(1/)) for Ref [38], and  () for all other protocols.Ref [38] and Ref [16] also minimize  complexities.Protocols labelled * only require sparse connectivity, i.e. each qubit connect to  (1) of other qubits.

Protocols
Count Depth Ref [32,33]  ()  ( 2 ) Ref [16,38]  Theorem. 3 achieves linear scaling of Clifford+ count with respect to , and this is applied for arbitrary space complexity.When  anc =  (), the circuit depth is lower than the bestknown result of   log(  /) . Moreover, compared to [32] which also study the space-time trade-off of state preparation, our method improves the circuit depth scaling for a factor of Õ ( anc /log  anc ).Summary of some representative state preparation protocols are provided in Table . 1 and Table. 2.
The main idea of our construction is as follows (see also Fig. 1).For  anc =  (), we construct the quantum state with a set of uniformly controlled rotations (UCR) with the method in [28].Instead of decomposing each UCR with identical accuracy, we distribute the decomposition error in an optimized way.UCR with  controlled qubits, denoted as -UCR, should be decomposed into 2  number of -qubit controlled rotations.When performing Clifford+ decomposition, to reduce the total circuit complexity, we allow larger decomposition accuracy when  becomes larger.
For  anc =  (), we improve the Clifford+ decomposition of the method in [34] in a similar way.In both cases, the gate count scaling  ( log(1/)) is achieved.For arbitrary ancillary qubit number between two extreme cases, we provide a scheme combing two protocols together, which allows spacetime trade-off.Details of our state preparation scheme and the corresponding complexity analysis are provided in Methods.We also note that our protocol for few qubit case can be combined with the depth-optimal scheme in [37].The circuit depth can then be improved to  ( log(1/) anc /log( anc )), at the cost of higher gate count.
We note that when the quantum state is sparse, the circuit complexity will be significantly lower.The construction of sparse state preparation is useful for sparse block-encoding.Details about sparse state preparation and sparse matrix blockencoding are provided in Supplementary Discussion 2.

Other useful subroutines
TABLE III.Summary of the Clifford+ circuit complexities of the operations serving as subroutines in this work.Suppose the subroutine has  dat data qubit, in the last column, the first row corresponds to the circuit depth when there are  anc =  ( dat ) ancillary qubit; the second row corresponds to the circuit depth without qubit number restriction.Õ suppresses the doubly logarithmic factors with respect to  and .SP, SOPS, SBM and SSP correspond to state preparation, select oracle for Pauli strings, sparse Boolean memory and sparse state preparation (Supplementary Discussion 2), respectively.Before discussing the construction of access models in Eq. ( 1) and Eq. ( 2), we introduce some other useful subroutines, including select oracle and quantum sparse Boolean memory.These operations may have applications individually in some other scenarios.For both operations, we obtain their spacetime trade-off constructions, which have improved or comparable Clifford+ complexities compared to the best-known realizations (see also Table .3).
Select oracle for Pauli strings.We consider a function of Pauli strings   =  =1  , , where  ∈ {0, 1, • • • , 2  − 1} and  , ∈ {±, ±, ± , ± }.We introduce two registers, the index register contains  qubits, and the word register contains  qubits.Select oracle for   is defined as where |⟩ represents the computational basis of index register, and the unitary   is applied at the word register.In other words, the state of index register controls the operations applied at the word register.Several proposals of implementing Eq. ( 6) has been introduced in the literature.For example, with  anc =  ancillary qubits, Ref [39] (Appendix G.4) proposed a method achieving  ( ) circuit depth and gate count with  = 2  .With  anc =  ( ) ancillary qubits, Eq. ( 6) is a special form of the "product unitary memory" in [34], which can be constructed with  (log( )) depth and  ( ) count of Clifford+ gates.We provide an algorithm with tunable ancillary qubit number achieving the circuit complexity as follows.
Compared to the result in Ref [39], our protocol reduces the circuit depth for a factor of  log  anc  anc while maintaining the gate count scaling.The proof of Theorem. 4 and details of circuit constructions are provided in Methods.
Sparse Boolean memory.We consider a sparse Boolean function  : {0, 1}  → {0, 1} ñ, which has totally  input digits  satisfying () ≠ 0 • • • 0. Given an -qubit index register (denoted as idx) and a ñ-qubit register (denoted as wrd), we define the sparse Boolean memory Select() as a unitary satisfying We have the following result (see Methods for proof).Different from SAIM, Eq. ( 7) contains constant number of nonzero outputs.So its construction requires much less resource.

Construction of SAIM
With all necessary tools ready, we now discuss the construction of the SAIM in Eq. (1).We have the following result.depth of Clifford+ gates.The total gate complexity is therefore the combination of three steps above.□ Compared to the circuit complexity lower bound obtained in Theorem. 1, our protocol has nearly optimal circuit complexities with respect to the matrix dimension up to a factor of .As mentioned before, SAIM is a standard access model in many quantum algorithms, and the query complexity to SAIM has been studied extensively for various tasks.With Theorem.6, one can directly obtain the natural circuit complexity of those algorithms.Further discussions are provided in the DISCUSSION section.

Construction of LCU-based Block-encoding
The construction of LCU-based block-encoding can be realized with quantum state preparation and select oracle [13].
Let  |⟩ be the state preparation unitary for |⟩, and we define G ≡  |⟩ ⊗ I 2  .We then define a Select oracle corresponding to Eq. (3) as Select(  ) = −1 =0 | ⟩⟨| ⊗   .It can be verified that G † Select(  )G is a block-encoding of  with normalization factor  = 1 [14].The constructions of LCU-based block-encoding is then reduced to the quantum state preparation and Select(  ), both of which can be constructed with polynomial-size quantum circuits.
The exact circuit complexity of block-encoding depends on the specific form of   .We take the LCU for Pauli strings (Eq.(4)) as an example.Based on our improved quantum state preparation (Theorem.3) and Select oracle for Pauli strings (Theorem.4), we have the following result, where ( anc , )-block-encoding is the abbreviation of (1,  anc , )block-encoding (see Methods section for proof).The block-encoding of LCU can be constructed with polylogarithmic circuit complexity with respect to the data dimension, as oppose to the SAIM requiring polynomial gate count.Therefore, for structured classical data in the form of Eq. ( 3) exponential quantum advantage can be expected.In below, we provide further discussions about by our results.

III. DISCUSSION
As demonstrated in Theorem. 1, a general SAIM can not be implemented with  (Poly()) size quantum circuit.In the language of complexity class, this implies that BQP SAIM ≠BQP, where SAIM represent the quantum oracles in the form of Eq. ( 1).In other words, if problem  can be solve with polynomial number of queries to the SAIM,  is not necessarily solvable with polynomial-size quantum circuits.In fact, it is reasonable to conjecture that BQP SAIM ≠PSPACE when considering the scaling with .The reason is that for a general matrix with 2  dimension, storing all its element requires exponentially large space, and this is true even for sparse matrix.The same argument applies to the block-encoding of sparse matrices as well.
This argument is consistent with the results about classical dequantization algorithms [40,41], which demonstrate that sub-linear classical runtime can be achieved for tasks such as recommendation systems and solving linear systems.Note that these algorithms assumes a classical oracle similar to SAIM.
On the other hand, our study on sparse matrix encoding still has its great value.First of all, it is rare to have structured classical data that can be encoded with logarithmic complexity.In many cases, sparse matrix is the most compact representation for classical data of interest.Second, with SAIM or block-encoding, polynomial quantum speedup with respect to the matrix dimension  is still possible.Our constructions are nearly optimal, and can be used to estimate the concrete Clifford+ complexities of many quantum algorithms of practical interest.Finally, techniques developed here may serve as a subroutine for encoding a larger matrix with special structures, with which the with which exponential quantum advantage may be possible.
An open question is how to determine whether a given matrix is efficiently block-encodable.This problem can be considere as a generalization of the unitary complexity problem [42][43][44][45], which is important due to the broad applications of block-encoding [15].According to Theorem.7, LCU for efficient unitaries [Eq.( 3)] is a sufficient condition of efficient block-encoding.Due to the generality and simplicity of LCU, it is reasonable to conjecture that the decomposition of a matrix in the form of Eq. ( 3) has close relation to the efficiency of its block-encoding.The block encoding of  is challenging when it can not be well approximated by Eq. ( 3) with  =  (Poly()).
In conclusion, we have studied the circuit complexities of typical quantum access models, such as SAIM and blockencoding.We show that the circuit complexity lower bound for encoding sparse matrix is polynomial with respect to the matrix dimension.We provide nearly-optimal construction protocols to achieve the lower bound.For LCU-based blockencoding, we develop a construction protocol based on the improved implementation of quantum state preparation and select oracle for Pauli strings.Our protocols are based on Clifford+ gates and allow tunable ancillary qubit number.We expect that our results are useful for processing classical data with quantum devices [46][47][48].Future works may include the study of the circuit complexity lower bound for block-encoding, and how to further improve our protocols to achieve the lower bounds.Another interesting topic is about the power of quantum circuits with global quantum channels.For example, if the feedback controls are dependent on the measurement outcomes of many measurements.In this case, the elementary operations may no longer be described by local operations, and the computation power of the circuit is expected to be enhanced.In the direction of applications, it is interesting to find practical classical problems, whose data structure are able to be represented in the form LCU. In those scenarios, exponential quantum advantage can be expected.

Quantum state preparation
We first consider the preparation with  ancillary qubits.There are some state preparation protocol with optimal single-and two-qubit gate count, such as Ref [28].However, with direct Clifford+ decomposition, the gate complexity becomes suboptimal.We achieve gate count and circuit depth linear to the state dimension with an optimized Clifford+ decomposition.The result is as follows.Lemma 3.With  ancillary qubits, an arbitrary quantum state can be prepared to precision  with  ( log(1/)) depth and  ( log(1/)) count of Clifford + gates.
Proof.According to [28], with single-and two-qubit gates, an arbitrary quantum state | targ ⟩ can be expressed as where    and    are uniformly controlled Z-and Y-rotations with single qubit rotation gates   () =  −    /2 ,   () =  −    /2 .Here   , ∈ R and   , ∈ R are some rotation angles, the exact values of which are not important for our analysis.
Single-qubit rotations can be approximated with Clifford+ gates.

𝑧
Note that  †  can be realized by the inverse conjugation of the Clifford+ gate sequence of   .Similar argument is also applied for   (  , ).Then, according to Lemma.6 as will be introduced in the next section, one can construct the following unitaries with  ancillary qubits,  (2  log(1/  )) depth and  (2  log(1/  )) count of Clifford+ gates.We therefore approximate the target state with the following In In a similar way, we can obtain Similarly, the total circuit depth is

□
We then consider the quantum state preparation with exponential ancillary qubits.Our protocol follows the same idea in [34] with improvement.Lemma 4. Arbitrary -qubit quantum state can be prepared with  () ancillary qubits,  ( log(/)) depth and  ( log(1/)) count of Clifford+ gates.Proof.Our construction is based on the protocol in [34] with revision and improved Clifford+ decomposition.
General procedure.The hardware layout of our method contains a binary tree of qubits with  + 1 layers, which is denoted as .The th (with 0 ⩽  ⩽ ) layer of  is denoted as   .For 1 ⩽  ⩽ ,   connects to another binary tree of qubits, denoted as   .The root of the tree   serves as the th data qubit, and we denote it as d  here.
Our protocol for preparing target state | targ ⟩ = 2  −1 =0   |⟩ d works as follows.We initialize the root of  as |1⟩  1 while all other qubits are at state |0⟩.In the first stage,  is prepared at the quantum state (qubits at state |0⟩ are not shown) Here, |  ⟩ is one of the computational basis of  to be defined later.In the second stage, the data qubits are transferred to the -qubit computational basis |⟩ d conditioned on |  ⟩, i.e.
Finally, the binary tree  is uncomputed The target state is then obtained after tracing out .The readers are refereed to [34] for more details.Transformations in Eq. ( 22) and Eq. ( 23) can be ideally realized using Clifford circuit with  () depth and  (2  ) gate count.On the other hand, the first stage for obtaining Eq. ( 21) contains rotation that has to be approximated with  gates and hence more complicated.So we focus on Eq. ( 21) in below.Realization of Eq. ( 21).We will first show how Eq. ( 21) can be realized with single-qubit and CNOT gates with a method slightly different from [34], and then introduce its Clifford + decomposition.
We then introduce the realization of We define single qubit rotation   () = cos  sin  − sin  cos  and   () =  −  0 0    , and a three-qubit controlled operation as follows.

Circuit complexity.
Each   can be realized with  (2  log(1/  )) count and  (log(1/  )) depth of Clifford+ gates.Therefore, the total gate count at stage 1 (Eq.( 21)) is The total circuit depth at stage 1 is Recall that Eq. ( 22) and Eq. ( 23) has  () count and  () depth of Clifford+ gates.So the total gate count and circuit depth are  ( log(1/)) and  ( log(/)) respectively.□ We also cares about the controlled quantum state preparation.In our preparation scheme, the initial state is |1⟩  1 , i.e. the root of  is set as |1⟩.If we set  1 as |0⟩  1 instead, it can be verified that the output state is |0 • • • 0⟩ d .Therefore, to implement controlled state preparation, one can simply replace the root qubit  1 by the controlled qubit, and the circuit complexity remains unchanged.In other words, we have the following result.Proof.We separate all data qubits into two registers.Register  contains the last   =  − ⌊log 2 ⌋ data qubits, and register  contains the first   = ⌊log 2 ⌋ qubits for some  ⩽  ⩽ 2  .We define   = 2   and   = 2   .The target state can be rewritten as for some normalized   , and normalized quantum states |  ⟩.
In the second step, we implement Select(   ) where   is a state preparation unitary satisfying gate count, and depth of Clifford+ gates.By setting  anc =  () for some  anc ⩾ , we complete the proof.□ Select oracle for general unitary functions.Suppose  is an -bit bitstring, and   are general unitaries.We consider the unitary where  = 2  .In below, we discuss how to construct Select(  ) based on the implementation of single-qubitcontrolled-  , and the corresponding circuit complexity upper bound.We define  ctrl (  , ) and  ctrl (  , ) as the count and depth of Clifford+ gates required to construct the controlled-  , given  ancillary qubits.The following result corresponds to the case with  +  ancillary qubits.
Proof.We introduce an ancillary register with  qubits.We denote the th qubit at the index register (encoding |⟩) and ancillary registers as   ,   respectively.We also denote .  0 is initialized as |1⟩ while all other ancillary qubits are initialized as |0⟩.
Eq. ( 40) can be realized by querying Select(, , , 0), which is defined recursively by Algorithm.1.In Algorithm.1, Toffoli(, ; ) is the Toffoli gate with qubit  and  as the controlled qubits and  as the target qubit; C-  () is the controlled-  with qubit  as controlled qubit and the corresponding word register as target qubits; dim() represent the dimension of the vector (for example, dim() = );   represents the th element of  and  : Algorithm 1 Select( , , , ) C-  ( 1 ) 10: end if In our implementation, the controlled-  are queried for totally  times with  ∈ {0, • • • ,  − 1} sequentially.Moreover, there are totally  () Toffoli gates acting sequentially.Therefore, the total gate count and circuit depth are  ( ctrl (  , )) and  (  ctrl (  , )) respectively.□ We note that Algorithm. 1 can be further simplified by combining some concatenated gates [39].But the asymptotic scaling here is optimal.
We then consider the construction of expoential ancillary qubits.In Algorithm 4,5 of [34], based on the bucket-brigade architecture for quantum random access memory [50][51][52], it has been shown that any Select(  ) can be constructed by 4 − 1 ancillary qubits,  () Clifford+ gates arranged in  () circuit depth, and queries to all single-qubit-controlled-  for  ∈ {0, • • • ,  − 1} in parallel.If each controlled-  uses  ancillary qubits, we require totally  (4+) −1 ancillary qubits, because they are implemented in parallel.To sum up, we have the following result.
Given  ancillary qubits, controlled-  can be constructed with the following circuit.
The actual circuit complexity depends on the form of   .We now proof Theorem.7 For any 1 We also define Select(  ) = Select( : ).For each   , we further define Boolean functions  , min : max () =   () ∧ ( min ⩽  ⩽  max ) for  min ⩽  max .For any 0 We first consider the construction with ancillary qubit number  () ⩽  anc ⩽  ( ).In this case, we decompose Select(()) with Eq. (57).We let  = ⌊ anc /()⌋ and  ′ = ⌈ ñ/⌉, and We then consider the construction with ancillary qubit number  () ⩽  anc ⩽  ().In this case, we first perform the decomposition Select() = ñ =1 Select(  ).Then, we decompose each Select(  ) with Eq. (58).We let  = ⌊/⌋ and  ′ = ⌈/⌉, and  We then study the upper bound of the total number of unique quantum circuits with circuit depth  and ancillary qubit number  anc .With a fixed qubit number  +  anc , the quantum circuit may be constructed in the following way.In the first stage, at each layer, we partition ( +  anc ) qubits into ( +  anc )/2 subsets (assuming ( +  anc ) is even), each of which contains two qubits.There are totally  (( +  anc ) 2 ) possible ways of partitions.In the second stage, we fill each subset with an elementary gate in G ele .There are totally  choices for each subset, so there are totally  ( (+ anc ) /2 ) possible choices in this stage.G ′  anc , is a subset of quantum circuits constructed in the process above.So we have G ′  anc , may be tighten with a restriction similar to Lemma. 1, i.e. acting trivially at the ancillary subspace.But the current result is sufficient for our analysis.

C. Circuit depth lower bound for SAIM: proof of Theorem. 1
To begin with, we introduce the following lemma about the number of unique matrices required to "cover" another set of matrices.Equivalently, we have ( +  anc ) = Ω() and  = Ω(  ) for arbitrary  ∈ (0, 1).

□ F. Generalization to quantum channels
Our result can be readily generalized to quantum channels.More specifically, we are given an elementary set of quantum channels where E  can be an arbitrary operations applied at two-qubit systems.This includes unitary, measurement, and corresponding feedback control, etc.We also restrict that E (channel) ele =  (1), i.e. there are constant number of elementary quantum channels.With this definition, we have the following results.

Lemma 11. Let G (channel)
be the set containing all -qubit quantum channels that can be constructed with  elementary channels in G (ℎ)   ele defined in Eq (S-21) above.Then, we have log G (channel)  =  (( log( + ))), even with unlimited ancillary qubit number.

•FIG. 1 .
FIG. 1. State preparation achieving  () Clifford+ count for few qubit case.The operation is decomposed into uniformly controlled Z-and Y-rotations, whose control and rotation parts are denoted with red and blue colors respectively.Each -UCR is decomposed into 2  multi-qubit controlled single-qubit rotations, and  increases with the opacity of the control part (red).Each Z-or Y-rotation is decomposed into Clifford+ gates, and the decomposition accuracy increases with the opacity of the rotation part (blue).

TABLE I .
Clifford+T complexities of -qubit state preparation protocols with fixed accuracy  and total qubit (data qubit + ancillary qubit) number  ().The  scaling of Clifford+ count and depth are  (log(1/)) for all protocols.The  scaling of qubit number is  (log(1/)) for Ref