Quantum circuit architecture search for variational quantum algorithms

Variational quantum algorithms (VQAs) are expected to be a path to quantum advantages on noisy intermediate-scale quantum devices. However, both empirical and theoretical results exhibit that the deployed ansatz heavily affects the performance of VQAs such that an ansatz with a larger number of quantum gates enables a stronger expressivity, while the accumulated noise may render a poor trainability. To maximally improve the robustness and trainability of VQAs, here we devise a resource and runtime efficient scheme termed quantum architecture search (QAS). In particular, given a learning task, QAS automatically seeks a near-optimal ansatz (i.e., circuit architecture) to balance benefits and side-effects brought by adding more noisy quantum gates to achieve a good performance. We implement QAS on both the numerical simulator and real quantum hardware, via the IBM cloud, to accomplish data classification and quantum chemistry tasks. In the problems studied, numerical and experimental results show that QAS can not only alleviate the influence of quantum noise and barren plateaus, but also outperforms VQAs with pre-selected ansatze.

Variational quantum algorithms (VQAs) are expected to be a path to quantum advantages on noisy intermediate-scale quantum devices. However, both empirical and theoretical results exhibit that the deployed ansatz heavily affects the performance of VQAs such that an ansatz with a larger number of quantum gates enables a stronger expressivity, while the accumulated noise may render a poor trainability. To maximally improve the robustness and trainability of VQAs, here we devise a resource and runtime efficient scheme termed quantum architecture search (QAS). In particular, given a learning task, QAS automatically seeks a near-optimal ansatz (i.e., circuit architecture) to balance benefits and sideeffects brought by adding more noisy quantum gates to achieve a good performance. We implement QAS on both the numerical simulator and real quantum hardware, via the IBM cloud, to accomplish data classification and quantum chemistry tasks. In the problems studied, numerical and experimental results show that QAS can not only alleviate the influence of quantum noise and barren plateaus, but also outperforms VQAs with pre-selected ansatze.
The variational quantum learning algorithms (VQAs) [1, 2], including quantum neural network [3][4][5] and variational quantum eigen-solvers [6][7][8][9], are a class of promising candidates to use noisy intermediate-scale quantum (NISQ) devices to solve practical tasks that are beyond the reach of classical computers [10]. Recently, the effectiveness of VQAs towards small-scale learning problems such as low-dimensional synthetic data classification, image generation, and energy estimation for small molecules has been validated by experimental studies [11][12][13][14]. Despite the promising achievements, the performance of VQAs will degrade significantly when the qubit number and circuit depth become large, caused by the trade-off between the expressivity and trainability [15]. More precisely, under the NISQ setting, involving more quantum resources * duyuxuan123@gmail.com † min-hsiu.hsieh@foxconn.com ‡ dacheng.tao@sydney.edu.au (e.g., quantum gates) to implement the ansatz results in both a positive and negative aftermath. On the one hand, the expressivity of the ansatz, which determines whether the target concept will be covered by the represented hypothesis space, will be strengthened by increasing the number of trainable gates [16][17][18][19]. On the other hand, a deep circuit depth implies that the gradient information received by the classical optimizer is full of noise and the valid information is exponentially vanished, which may lead to divergent optimization or barren plateaus [20][21][22][23][24].
With this regard, it is of great importance to design an efficient approach to dynamically control the expressivity and trainability of VQAs to attain good performance.
Initial studies have developed to two leading strategies to address the above issue. The first one is quantum error mitigation techniques. Representative methods to suppress the noise effect on NISQ machines are quasiprobability [25,26], extrapolation [27], quantum subspace expansion [28], and data-driven methods [29,30]. In parallel to quantum error mitigation, another way is constructing ansatz with a variable structure. Compared with traditional VQAs with the fixed ansatz, this approach can not only maintain a shallow depth to suppress noise and trainability issues, but also keep sufficient expressibility to contain the solution. Current literature generally adopts brute-force strategies to design such a variable ansatz [31][32][33]. This implies that the required computational overhead is considerable, since the candidates of possible ansatze scale exponentially with respect to the qubits count and the circuit depth. How to efficiently seek a near-optimal ansatz remains largely unknown.
In this study, we devise a quantum architecture search scheme (QAS) to effectively generate variable structure ansatze, which considerably improves the learning performance of VQAs. The advantage of QAS is ensured by unifying the noise inhibition and the enhancement of trainability for VQAs as a learning problem. In doing so, QAS does not request any ancillary quantum resource and its runtime is almost the same as conventional VQAbased algorithms. Moreover, QAS is compatible with all quantum platforms, e.g., optical, trapped-ion, and superconducting quantum machines, since it can actively adapt to physical restrictions and weighted noise of varied quantum gates. In addition, QAS can seamlessly integrate Step 1, QAS sets up supernet A, which defines the ansatze pool S to be searched and parameterizes each ansatz in S via the specified weight sharing strategy. All possible single-qubit gates are highlighted by hexagons and two-qubit gates are highlighted by the brown rectangle. The unitary Ux refers to the data encoding layer. In Step 2, QAS optimizes the trainable parameters for all candidate ansatzes. Given the specified learning task L, QAS iteratively samples an ansatz a (t) ∈ S from A and optimizes its trainable parameters to minimize L. A correlates parameters among different ansatzes via weight sharing strategy. After T iterations, QAS moves to Step 3 and exploits the trained parameters θ (T ) and the predefined L to compare the performance among K ansatze. The ansatz with the best performance is selected as the output, indicated by a red smiley face. Last, in Step 4, QAS utilizes the searched ansatz and the parameters θ (T ) to retrain the quantum solver with few iterations.

RESULTS
The mechanism of VQAs. Before moving on to present QAS, we first recap the mechanism of VQAs. Given an input Z and an objective function L, VQA employs a gradient-based classical optimizer that continuously updates parameters in an ansatz (i.e., a parameterized quantum circuit) U (θ) to find the optimal θ * , i.e., where C ⊆ R d is a constraint set, and θ are adjustable parameters of quantum gates [16,18]. For instance, when VQA is specified as an Eigen-solver [6], Z refers to a Hamiltonian and the objection function could be chosen as L = Tr(Z |ψ(θ) ψ(θ)|), where |ψ(θ) is the quantum state generated by U (θ). For compatibility, throughout the whole study, we focus on exploring how QAS enhances the trainability of one typical heuristic ansatz-hardwareefficient ansatz [11,13]. Such an ansatz is supposed to obey a multi-layer layout, where U l (θ) consists of a sequence of parameterized singlequbit and two-qubit quantum gates, and L denotes the layer number. Note that the arrangement of quantum gates in U l (θ) is flexible, enabling VQAs to adequately use available quantum resources and to accord with any physical restriction. Remarkably, the achieved results can be effectively extended to other representative ansatze.
The scheme of quantum architecture search. Let us formalize the noise inhibition and trainability enhancement for VQAs as a learning task. Denote the set S as the ansatze pool that contains all possible ansatze (i.e., circuit architectures) to build U (θ) in Eqn. (2). The size of S is determined by the qubits count N , the maximum circuit depth L, and the number of allowed types of quantum gates Q, i.e., |S| = O(Q N L ). Throughout the whole study, when no confusion occurs, we denote a as the a-th ansatz U (θ, a) in S. Notably, the performance of VQAs heavily relies on the employed ansatz selected from S. Suppose the quantum system noise, induced by a, is modeled by the quantum channel E a . Taking into account of the circuit architecture information and the related noise, the objective of VQAs can be rewritten as The learning problem formulated in Eqn. (3) forces the optimizer to output the best quantum circuit architecture a * by assessing both the effect of noise and the trainability. Notably, Eqn. (3) is intractable via the two-stage optimization strategy that is broadly used in previous literature [31][32][33], i.e., individually optimizing all possible ansatze from scratch and then ranking them to obtain (θ * , a * ). This is because the classical optimizer needs to store and update O(dQ N L ) parameters, which forbids its applicability towards large-scale problems in terms of N and L.
The proposed QAS belongs to the one-stage optimization strategy. Different from the two-state optimization strategy that suffers from the computational bottleneck, this strategy ensures the efficiency of QAS. In particular, for the same number of iterations T , the memory cost of QAS is at most T times more than that of conventional VQAs. Meanwhile, their runtime complexity is identical. The protocol of QAS is shown in Figure 1. Two key elements of QAS are supernet and weight sharing strategy. Both of them contribute to locate a good estimation of (θ * , a * ) within a reasonable runtime and memory usage. Intuitively, weight sharing strategy in QAS refers to correlating parameters among different ansatze. In this way, the parameter space, which amounts to the total number of trainable parameters required to be optimized in Eqn. (3), can be effectively reduced. As for supernet, it plays two significant roles in QAS: 1) supernet serves as the ansatz indicator, which defines the ansatze pool S (e.g., determined by the maximum circuit depth and the choices of quantum gates) to be searched; 2) supernet parameterizes each ansatz in S via the specified weight sharing strategy. QAS includes four steps, i.e., initialization (supernet setup), optimization, ranking, and fine tuning. We now elucidate these four steps.
1. (Initialization.) QAS employs a supernet A as an indicator for the ansatze pool S. Concretely, the setup of the supernet A amounts to leveraging the indexing technique to track S using a linear memory cost. For instance, when N = 4, L = 1, and the choices of the quantum gates are See Method for the construction of the ansatze pool S involving two-qubit gates. Meantime, as detailed below, A parameterizes all candidate ansatze via weight sharing strategy to reduce parameter space.
2. (Optimization.) QAS jointly optimizes {(a, θ)} in Eqn. (3). Similar to conventional VQAs, QAS optimizes trainable parameters in an iterative manner. At the t-th iteration, QAS uniformly samples an ansatz a (t) from S (i.e., an index list indicated by A). To minimize L in Eqn. (3), the parameters attached to the ansatz a (t) is updated to θ (t+1) = θ (t) − η∂L(θ (t) , a (t) , Z, E a (t) )/∂θ (t) , with η being the learning rate. The total number of updating is set as T . Note that since the optimization of VQAs is NP-hard [37], empirical studies generally restrict T to be less than O(poly(QN L)) to obtain an estimation within a reasonable runtime cost.
To avoid the computational issue encountered by the two-stage optimization method, QAS leverages weight sharing strategy developed in deep neural architecture search [38] to parameterize ansatze in S via a specified correlation rule. Concretely, for any ansatz a ∈ S, if the layout of the single-qubit gates of the l-th layer between a and a (t) is identical with ∀l ∈ [L], then A uses the training parameters θ (t) assigned to U l (θ (t) , a (t) ) to parametrize U l (θ , a ), regardless of variations in the layout of other layers. We remark that the parameterization shown above is efficient, which can be accomplished by comparing the generated index list and the stored index lists. In addition, the above correlated updating rule implies that the parameters of unsampled ansatze are never stored the in classical memory. To this end, even though the size of the ansatze pool exponentially scales in terms of N and L, QAS harnesses supernet and weight sharing strategy to guarantee its applicability towards large-scale problems.
3. (Ranking.) After T iterations, QAS uniformly samples K ansatze from S (i.e., K index lists generated by A), ranks their performance, and then assigns the ansatz with the best performance as the output to estimate a * . Mathematically, denoted K as the set collecting the sampled K ansatze, the output ansatz is arg min In QAS, K is a hyper-parameter to balance the tradeoff the efficiency and performance. To avoid the exponential runtime complexity of QAS, the setting of K should polynomially scale with N , L, and Q. Besides random sampling, other methods such as evolutionary algorithms can also be used to establish K with better performance. See Supplementary D for details. 4. (Fine tuning). QAS employs the trained parameters θ (T ) to fine tune the output ansatz in Eqn. (4).
We empirically observe fierce competition among different ansatze in S when optimizing QAS (See Supplementary B for details). Namely, suppose S can be decomposed into two subsets S good and S bad , where the subset S good (S bad ) collects ansatze in the sense that they all attain relatively good (bad) performance via independently training. For instance, in the classification task, the ansatz in S good (S bad ) promises a classification accuracy above (below) 99%. However, when we apply QAS to accomplish the same classification task, some ansatze in S bad may outperform certain ansatze in S good . This observation hints the hardness of optimizing correlated trainable parameters among all ansatze accurately, where the learning performance of a portion of ansatze in S good is no better than training them independently.
To relieve fierce competition among ansatze in S and further boost performance of QAS, we slightly modify the initialization and optimization steps of QAS. Specifically, instead of exploiting a single supernet, QAS involves W supernets to optimize the objective function in Eqn. (3). The weight sharing strategy applied to W supernets are independent with each other, where the parameters corresponding to W supernets are separately initialized and updated. At the training and ranking stages, W supernets separately utilize weight sharing strategy to parameterize the sampled ansatz a (t) to obtain W values of L(θ (t,w) , a (t) , Z, E a ), where θ (t,w) refers to the parameters corresponding to the w-th supernet. Then, the parameters applied to the ansatz a (t) is categorized into the w -th supernet when w = arg min w∈[W ] L(θ (t,w) , a (t) , Z, E a ).
We last emphasize how QAS enhances the learning performance of hardware-efficient ansatz U (θ) in Eqn. (2). Recall that the central aim of QAS is to seek a good ansatz associated with optimized parameters to minimize L(θ, a, Z, E a ) in Eqn. (3). In other words, given U = L l=1 U l (θ), a good ansatz is located by dropping some unnecessary multi-qubit gates and substituting singlequbit gates in U l (θ) for ∀l ∈ [L]. Following this routine, several studies have proved that removing multi-qubit gates to reduce the entanglement of the ansatz contributes to alleviate barren plateaus [39,40]. In addition, a recent study [41] unveiled that the choice of the quantum circuit architecture can significantly affect the expressive power of the ansatz and the learning performance. Since the objective function of QAS implicitly evaluates the effect of different ansatze, our proposal can be employed as a powerful tool to enhance the learning performance of VQAs. Refer to Method for further explanation about the role of supernet, weight sharing, and analysis of the memory cost and runtime complexity of QAS.
Simulation and experimental results. The proposed QAS is universal and facilitates a wide range of VQAs based learning tasks, e.g., machine learning [42][43][44][45], quantum chemistry [6,14], and quantum information processing [46,47]. In the following, we separately apply QAS to accomplish a classification task and a variational quantum eigen-solver (VQE) task to confirm its capability towards the performance enhancement. All numerical simulations are implemented in Python in conjunction with the Pen-nyLane and the Qiskit packages [48,49]. Specifically, PennyLane is the backbone to implement QAS and Qiskit supports different types of noisy models. We defer the explanation of basic terminologies in machine learning and quantum chemistry in Appendices B and C.
Here we first apply QAS to achieve a binary classification task under both the noiseless and noisy scenarios. Denote D as the synthetic dataset, where its construction rule follows the proposal of the quantum kernel classifier [11]. The dataset D contains n = 300 samples. For each example {x (i) , y (i) }, the feature dimension of the input x (i) is 3 and the corresponding label y (i) ∈ {0, 1} is binary. Examples of D are shown in Figure 2. At the data preprocessing stage, we split the dataset D into the training set D tr , validation set D va , and test set D te with size n tr = 100, n va = 100, and n te = 100. The explicit form of the objective function is where {x (i) , y (i) } ∈ D tr andỹ (i) (A, x (i) , θ) ∈ [0, 1] is the output of the quantum classifier (i.e., a function taking the input x (i) , the supernet A, and the trainable parameters θ). The training (validation and test) accuracy is measured by i 1 g(ỹ (i) )=y (i) /n tr ( i 1 g(ỹ (i) )=y (i) /n va and i 1 g(ỹ (i) )=y (i) /n te ) with g(ỹ (i) ) being the predicted label for x (i) . We also apply the quantum kernel classifier proposed by [11] to learn D and compare its performance with QAS, where the implementation of such a quantum classifier is shown in Figure 2 (b). See Supplementary B for more discussion about the construction of D and the employed quantum kernel classifier. The hyper-parameters for QAS are as follows. The number of supernets is W = 1 and W = 5, respectively. The circuit depth for all supernets is set as L = 3. The search space of QAS is formed by two types of quantum gates. Specifically, at each layer U l (θ), the parameterized gates are fixed to be the rotational quantum gate along Y -axis R Y . For the two-qubit gates, denoted the index of three qubits as (0, 1, 2), QAS explores whether applying CNOT gates to the qubits pair (0, 1), (0, 2), (1, 2) or not. Hence, the size of S equals to |S| = 8 3 . The number of sampled ansatze for ranking is set as K = 500, The setting K ≈ |S|, enables us to understand how the number of supernets W , the number of epochs T , and the system noise effect the learning performance of different ansatze in the ranking stage.
Under the noiseless scenario, the performance of QAS with three different settings is exhibited in Figure 2 (d).
In particular, QAS with W = 1 and T = 10 attains the worst performance, where the validation accuracy for most ansatze concentrates on 50% − 60%, highlighted by the green bar. With increasing the number of epochs to T = 400 and fixing W = 1, the performance is slightly improved, i.e., the number of ansatze that achieves validation accuracy above 90% is 30, highlighted by the yellow bar. When W = 5 and T = 400, the performance of QAS is dramatically enhanced, where the validation accuracy of 151 ansatze is above 90%. The comparison between the first two settings indicates the correctness of utilizing QAS to accomplish VQA-based learning tasks in which QAS learns useful feature information and achieves better performance with respect to the increased epoch number T . The varied performance of the last two settings reflects the fierce competition phenomenon among ansatze and validates the feasibility to adopt W > 1 to boost performance of QAS. We retrain the output ansatz of QAS under the setting: W = 5 and T = 400, both the training and test accuracies converge to 100% within 15 epochs, which is identical to the original quantum kernel classifier.
The performance of the original quantum kernel classifier is evidently degraded when the depolarizing error for the single-qubit and two-qubit gates is set as 0.05 and 0.2, respectively. As shown in the lower plot of Figure 2 (f), the training and test accuracies of the original quantum kernel classifier drop to 50% (almost conduct a random guess) under the noisy setting. The degraded performance is caused by the large amount of accumulated noise, where the classical optimizer fails to receive the valid optimization information. By contrast, QAS can achieve a good performance under the same noise setting. As shown in Figure 2 (e), with setting W = 5 and T = 400, the validation accuracy of 115 ansatze is above 90% under the noisy setting. The ansatz that attains the highest validation accuracy is shown in 2 (c). Notably, compared with the original quantum kernel classifier in Figure 2 (b), the searched ansatz contains fewer CNOT gates. This implies that, under the noisy setting formulated above, QAS suppresses the noise effect and improves the training performance by adopting few CNOT gates. When we retrain the obtained ansatz with 10 epochs, both the train and test accuracies achieve 100%, as shown in the upper plot of Figure 2 (f). These results indicate the feasibility to apply QAS to achieve the noise inhibition and trainability enhancement.
We defer the omitted simulation results and the exploration of fierce competition to Supplementary B. In particular, we assess the learning performance of the quantum classifier with the hardware-efficient ansatz and the ansatz searched by QAS under the noise model extracted from the real quantum device, i.e., 'Ibmq_lima'. The achieved simulation result indicates that the ansatz obtained by QAS outperforms the conventional quantum classifier.
We next apply QAS to find the ground state energy of the Hydrogen molecule [13,50] under both the noiseless and noisy scenarios. The molecular hydrogen Hamiltonian is formulated as To tackle this task, the conventional VQE [6] and its variants [7][8][9] optimize the trainable parameters in U (θ) to prepare the ground state |ψ Figure 3 (a). Under the noiseless setting, the estimated energy of VQE fast converges to the target result E m within 40 iterations, as shown in Figure 3 (c). The hyper-parameters of QAS to compute the lowest energy eigenvalues of H h are as follows. The number of supernets has two settings, i.e., W = 1 and W = 5, respectively. The layer number for all ansatze is L = 3. The number of iterations and sampled ansatze for ranking is T = 500 and K = 500, respectively. The search space of QAS for the single-qubit gates is fixed to be the rotational quantum gates along Y and Z-axis. For the two-qubit gates, denoted the index of four qubits as (0, 1, 2, 3), QAS explores whether applying CNOT gates to the qubits pair (0, 1), (1, 2), (2, 3) or not. Therefore, the total number of ansatze equals to |S| = 128 3 . The performance of QAS with W = 5 is shown in Figure 3 (d). Through retraining the obtained ansatz of QAS with 50 iterations, the estimated energy converges to E m , which is the same with the conventional VQE.
The performance between the conventional VQE and QAS is largely distinct when the noisy model described in the classification task is deployed. Due to the large amount of gate noise, the estimated ground energy of the conventional VQE converges to −0.4 Ha, as shown in Figure 3 (c). In contrast, the estimated ground energy of QAE with W = 1 and W = 5 achieves −0.93 Ha and −1.05 Ha, respectively. Both of them are closer to the target result E m compared with the conventional VQE. Moreover, as shown in Figure 3 (e), a lager W implies better performance of QAS, since the estimated energy of most ansatze is below −0.6 Ha when W = 5, while the estimated energy of 350 ansatze is above 0 Ha when W = 1. We illustrate the generated ansatz of QAS with W = 5 in Figure 3 (b). In particular, to mitigate the effect of gate noise, this generated ansatz does not contain any CNOT gate, which is applied to a very large noise level. Recall that a central challenge in quantum computational chemistry is whether NISQ devices can outperform classical methods already available [51]. The achieved results in QAS can provide a good guidance to answer this issue. Concretely, the searched ansatz in Figure 3, which only produces the separable states that can be efficiently simulated by classical devices, suggests that VQE method may not outperform classical methods when NISQ devices contain large gate noise.
Note that more simulation results are deferred to Supplementary. Specifically, in Supplementary C, we exhibit more results of the above task. Furthermore, we implement VQE with the hardware-efficient ansatz and the ansatz searched by QAS on the real superconducting quantum hardware, i.e., 'Ibmq_ourense', to estimate the ground state energy of H h . Due to the runtime issue, we complete the optimization and ranking using the classical backend and perform the final runs on the IBMQ cloud. Experimental result indicates that the ansatz obtained by QAS outperforms the conventional VQE, where the estimated energy of the former is −0.96 Ha while the latter is −0.61 Ha. Then, in Supplementary D, we exhibit that utilizing the evolutionary algorithms to establish K can dramatically improve the performance of QAS. Sub-sequently, in Supplementary E, we provide the numerical evidence that QAS can alleviate the influence of barren plateaus. Last, we present a variant of QAS to tackle large-scale problems with the enhanced performance in Supplementary F.

DISCUSSION
In this study, we devise QAS to dynamically and automatically design ansatz for VQAs. Both simulation and experimental results validate the effectiveness of QAS. Besides good performance, QAS only requests similar computational resources with conventional VQAs with fixed ansatze and is compatible with all quantum systems. Through incorporating QAS with other advanced error mitigation and trainability enhancement techniques, it is possible to seek more applications that can be realized on NISQ machines with potential advantages.
There are many critical questions remaining in the study of QAS. Our future work includes the following several directions. First, we will explore better strategies to sample ansatz at each iteration. For example, the reinforcement learning techniques, which is used to construct optimal sequences of unitaries to accomplish quantum simulation tasks [52], may contribute to this goal. Next, we will design a more advanced strategy to shrink the parameter space while not degrading the learning performance. Subsequently, to further boost the performance of QAS, we will leverage some prior information of the learning problem such as the symmetric property and some post-processing strategies that remove redundant gates of the searched ansatz. In addition, we will delve to theoretically understanding the fierce competition. In the end, it is intriguing to explore applications of QAS beyond VQAs such as optimal quantum control and the approximation of the target unitary using the limited quantum gates.

M.1 The classical analog of QAS
The classical analog of the learning problem in Eqn. (3) is the neural network architecture search [38]. Recall that the success of deep learning is largely attributed to novel neural architectures for specific learning tasks, e.g., the convolutional neural networks for image processing tasks [53]. However, deep neural networks designed by human experts are generally time-consuming and errorprone [38]. To tackle this issue, the neural architecture search approach, i.e., the process of automating architecture engineering, has been widely explored, and achieved state of the art performances in many learning tasks [54][55][56][57][58]. Despite having a similar aim, naively generalizing classical results to the quantum scenario to accomplish Eqn. (3) is infeasible due to the distinct basic components: neurons versus quantum gates, classical correlation versus entanglement, the barren plateau phenomenon, the quantum noise affect, and physical hardware restrictions. These differences and extra limitations further intensify the difficulty of searching the optimal quantum circuit architecture a * , compared with the classical setting. In the following, we explain the omitted implementation details of QAS.

M.2 Weight sharing strategy
The role of weight sharing strategy is reducing the parameter space to enhance the learning performance of QAS within a reasonable runtime and memory usage. Intuitively, this strategy correlates parameters among different ansatze in S based on a specified rule. In this way, we can jointly optimize (θ, a) to estimate (θ * , a * ), where the updated parameters for one ansatz can also enhance the learning performance of other ansatze when the correlation criteria is satisfied. As explained in Figure  4, weight sharing strategy adopted in QAS squeezes the parameter space from O(dQ N L ) to O(dLQ N ). Meantime, our simulation results indicate that the reduction of parameter space enables QAS to achieve a good performance within a reasonable runtime complexity.
We remark that through adjusting the correlation criteria applied to weight sharing strategy, the parameter space can be further reduced. For instance, when all parameters in an ansatz are correlated, the size of the parameter space reduces to O(1). With this regard, another feasible correlation rule for QAS is unifying the single-qubit gates for all ansatze as In other words, QAS only adjusts the arrangement of two-qubit gates to enhance the learning perforamnce. From the practical perspective, this setting is reasonable since the gate error introduced by the single-qubit gates is much less than that of two-qubit gates.

M.3 Supernet
We next elucidate supernet used in QAS. As explained in the main text, supernet has two important roles, which are constructing the ansatze pool S and parameterizing each ansatz in S via the specified weight sharing strategy. In other words, supernet defines the search space, which subsumes all candidate ansatze, and the candidate ansatze in S are evaluated through inheriting weights from the supernet. Rather than training numerous separate ansatze from scratch, QAS trains supernet just once (Step 2 in Figure 1), which significantly cuts down the search cost.
We next explain how QAS leverages the indexing technique to construct S when the available quantum gates include both single-qubit and two-qubit gates. We first analyze the runtime complexity of QAS. In particular, at the first step, the setup of supernet, i.e., configuring out the ansatze pool and the correlating rule, takes O(1) runtime. In the second step, QAS proceeds T iterations to optimize trainable parameters. The runtime cost of QAS at each iteration scales with O(d), where d refers to the number of trainable parameters in Eqn. (1). Such cost origins from the calculation of gradients via parameter shift rule, which is similar with the optimization of VQAs with a fixed ansatz. To this end, the total runtime cost of the second step is O(dT ). In the ranking step, QAS samples K ansatze and compares their objective values using the optimized parameters. This step takes at most O(K) runtime. In the last step, QAS fine tunes the parameters based on the searched ansatz with few iterations (i.e., a very small constant). The required runtime is identical to conventional VQAs, which satisfies O(d). The total runtime complexity of QAS is hence O(dT + K).
We next analyze the memory cost of QAS. Specifically, the first step requests O(QN L) memory to specify the ansatze pool via the indexing technique. Recall the memory cost in this step is dominated by configuring the index space, which requests at most O(QN L) memory. This is because in the worst case, the allowed Q choices of quantum gates for the varied qubit at the varied layer are exactly different. To store information that describes choices of gates for different qubits at different position, the memory cost scales with O(QN L). In the second step, QAS totally outputs T index lists corresponding to the architecture of T ansatze. This requires at most O(T N L) memory cost. Moreover, QAS explicitly updates at most T d parameters (we omit those parameters that are implicitly updated via weight sharing strategy, since they do not consume the memory cost). To this end, the memory cost of the second step is O(T N L + T d). In the third step, QAS samples K index lists that describe the circuit architecture of K ansatze. This requires at most O(KN L) cost. Moreover, according to weight sharing strategy, the memory cost of storing the corresponding parameters is O(Kd). The memory cost of the last step is identical to the conventional VQAs with a fixed ansatz, which is O(d). The total memory cost of QAS is hence To better understand how the computational complexity scales with N , L and Q, in the following, we set the total number of iterations in Step 2 and the number of sampled ansatze in Step 3 as T = O(QN L) and K = O(QN L), respectively. Note that since the size of S becomes indefinite, it is reasonable to set K as O(QN L) instead of a constant used in the numerical simulations. Under the above settings, we conclude that the runtime complexity and the memory cost of QAS are O(dQN L) and O(dQN L + QN 2 L 2 ), respectively.
We remark that when W supernets are involved, the required memory cost and runtime complexity of QAS linearly scales with respect to W . Moreover, employing adversarial bandit learning techniques [59] can exactly remove this overhead (See Supplementary A for details).

DATA AVAILABILITY
The datasets generated and/or analyzed during the current study are available from Y.D. on reasonable request.

CODE AVAILABILITY
The source code of QAS to reproduce all numerical experiments is available on the GitHub repository https://github.com/yuxuan-du/Quantum_ architecture_search/.

COMPETING INTERESTS
The authors declare no competing interests. We organize the Supplementary as follows. In Supplementary A, we establish the connection between the bandit learning and the ansatz assignment task and discuss how to exploit bandit learning algorithms to further advance the ansatz assignment task. We then provide explanations and simulation results related to the fierce competition phenomenon and the classification task in Supplementary B. Afterwards, we present simulation and experiment details about the quantum chemistry tasks in Supplementary C. Subsequently, we exhibit how to introduce evolutionary algorithms into the ranking stage to boost the performance of QAS in Supplementary D. Next, we empirically explore the trainability of QAS through the lens of barren plateaus in Supplementary E. Last, we demonstrate a variant of QAS to effectively accomplish large-scale problems in Supplementary F.

A. The ansatz assignment task
In this section, we first connect the ansatz assignment task with the adversarial bandit learning problem. We then compare the method used in QAS with all bandit algorithms in terms of the regret measure. We last explain how to employ advanced bandit learning algorithms to reduce the runtime complexity of the ansatz assignment task.

The connection between the adversarial bandit learning and the ansatz assignment
Let us first introduce the adversarial bandit learning. In the adversarial bandit learning [59], a player has W possible arms to choose. Denote the total number of iterations as T . At the t-th iteration, • The player chooses an arm w (t) ∈ [W ] with a deterministic strategy or sampling from a certain distribution P w ; • The adversary chooses a cost c (t) (w (t) ) for the chosen arm w (t) ; • The cost of the selected arm w (t) , i.e., c (t) (w (t) ) with w (t) ∈ [W ], is revealed to the player.
The goal of the adversarial bandit learning is minimizing the total cost over T iterations, where its performance is quantified by the regret r T , i.e., Intuitively, the regret r T compares the cumulative cost of the selected arms {w (t) } T t=1 with the best arm in hindsight. If r T = o(T ), where the regret can be either negative or scales at most sublinearly with T , we say that the player is learning; otherwise, when r T = Θ(T ) such that the regret scales linearly with T , we say that the player is not learning, since the averaged cost per-iteration does not decrease with time.
We now utilize the language of the adversarial bandit learning to restate the ansatz assignment problem. In QAS, each arm refers to a supernet and the number of arms equals to the number of supernets. The cost c (t) (w (t) ) is equivalent to the objection function L(θ (t,w) , a (t) ) in Eqn. (3), where a (t) refers to the sampled ansatz a (t) ∈ S, and θ (t,w) represents the trainable parameters of the w-th supernet A (w) . The aim of the ansatz assignment is to allocate {a (t) } T t=1 to the best sequence of arms (supernets) to minimize the cumulative cost. Denote the selected sequence of arms (indices of supernets) of QAS as {I

The comparison between the strategy used in QAS and all bandit algorithms
The following theorem shows that the strategy used in QAS outperforms all bandit algorithms in terms of the regret measure.
Theorem 1. Let W and T be the number of supernets and iterations, respectively. Suppose that the ansatz a (t) is assigned to the I where the randomness is over the selection of I w } promises the regret R T ≤ 0, while the regret for the best bandit algorithms is lower bounded by R T = Ω(T ).
The proof of Theorem 1 exploits the following lemma.
We are now ready to prove Theorem 1.
Proof of Theorem 1. Here we first prove the regret R T in Eqn. (A3) for the assignment strategy employed in QAS. We then quantity the lower bound of R T for all adversarial bandit algorithms.
Recall the assignment strategy used in QAS. Given the sampled ansatz a (t) ∈ S, QAS feeds this ansatz into W supernets and compares W values of objective functions, i.e., {L(θ (t,w) , a (t) )} W w=1 . Then, the ansatz a (t) is assigned to the I where the last inequality employs the fact that the summation of minimum values of functions is less than the minimum value of summation of functions (i.e., t min x f t (x) ≤ min x t f t (x) and the equality is hold when the minimum of all functions {f t (x)} is identical). Denote the regret R T in Eqn. (A3) obtained by a given bandit algorithm as R B T . Due to Lemma 1, we achieve In other words, for the ansatz assignment task, the regret for all adversarial bandit algorithms is lower bounded by R B T ≥ Ω( W T log(1/δ)) with probability δ. Based on Eqn. (A6) and Eqn. (A7), we conclude that with high probability, no bandit learning algorithm can achieve a lower regret than that of the strategy adopted in QAS.

Applying bandit learning algorithms to the ansatz assignment task
Here we discuss how to apply bandit learning algorithms to improve the ansatz assignment task in terms of the runtime cost. Recall the ansatz assignment strategy used in QAS. At each iteration, the sampled ansatz should feed into W supernets separately and then compare the returned W objective values. In this way, the runtime complexity becomes expensive for a large W , as discussed in Method. The adversarial bandit learning algorithms are a promising solution to tackle the runtime issue. As explained in Supplementary A 1, when adversarial bandit learning algorithms are employed, the ansatz is only required to feed into one supernet at each iteration, while the price is inducing a relatively large regret bound.

B. The synthetic dataset classification task
The outline of this section is as follows. In Supplementary B 1, we first introduce some basic terminologies in machine learning to make our description self-consistent. In Supplementary B 2, we explain how to construct the synthetic dataset D. In Supplementary B 3, we provide the simulation results omitted in the main text and elaborate on the fierce competition phenomenon. Last, in Supplementary B 4, we compare the learning performance of the quantum classifier with the hardware-efficient ansatz and the ansatz searched by QAS under the noise model extracted from the real quantum device, i.e., an IBM's 5-qubit quantum machine nameds as 'Ibmq_lima'.

Basic terminologies in machine learning
When we apply QAS to accomplish the classification task, the terminology 'epoch', which is broadly used in the field of machine learning [53], is employed to replace 'iteration'. Intuitively, an epoch means that an entire dataset is passed forward through the quantum learning model. For the quantum kernel classifier used in the main text, each training example in D tr is fed into the quantum circuit in sequence to acquire the predicted label. Since D tr includes in total 100 examples, it will take 100 iterations to complete one epoch.
In the synthetic classification task, we split the datasets into three parts, i.e., the training, validation, and test datasets, following the convention of machine learning [53]. The training dataset D tr is used to optimize the trainable parameters during the learning process. The function of the validation dataset D va is estimating how well the classifier has been trained. During T epochs, the trainable parameters that achieve the highest validation accuracy are set as the output parameters. Mathematically, the output parameters satisfŷ where {x (i) , y (i) } ∈ D va ,ỹ (i) is the prediction of the classifier given θ (t) and x (i) , and 1 z is the indicator function that takes the value 1 if the condition z is satisfied and zero otherwise. Finally, the output parametersθ are applied to the test dataset to benchmark the performance of the trained classifier.

Implementation of the synthetic dataset
Here we recap the method to construct the synthetic dataset proposed in [11]. Denote the encoding layer as To establish the synthetic dataset D used in the main text, we first generate a set of data points {x (i) } with x (i) ∈ R 3 . We then define the optimal circuit as where U * l (θ * l ) = ⊗ 3 j=1 R Y (θ * l,j )(CNOT ⊗ I 2 )(I 2 ⊗ CNOT) and the parameter θ * l,j is uniformly sampled from [0, 2π) for all j ∈ [3] and l ∈ [3]. The strategy to label x (i) is as follows. Let Π = I 4 ⊗ |0 0| be the measurement operator. The data point x (i) is labeled as y The label of x (i) is assigned as Note that, if the measured result is in the range (0.25, 0.75), we drop this data point and sample a new one. By repeating the above procedure, we can built the synthetic dataset D.
Here we first introduce how to use the quantum kernel classifier to conduct the prediction. Given the data point x (i) ∈ D at the t-th epoch, the quantum kernel classifier is composed of two unitraies, i.e., U x (i) and U (θ (t) ), where the sequence of quantum gates in U (θ (t) ) is fixed as shown in Figure 2 (b). The output of quantum kernel classifier yields The predicted label of x (i) , i.e., g(ỹ(x (i) , θ (t) )), becomes When QAS is employed to enhance the trainability and to mitigate error of the quantum kernel classifier, the arrangement of quantum gates in U (θ) is no longer fixed and depends on the sampled ansatz. In other words, at the t-th epoch, given the data point x (i) ∈ D, the measured resultỹ(A, x (i) , θ (t) ) in Eqn. (5) is where U (θ (t) , a) denotes that the trainable unitary amounts to the ansatz a and the corresponding trainable parameters θ (t) are controlled by the supernet A.
We then provide the simulation results of the conventional quantum kernel classifier and QAS towards the synthetic dataset D under the noiseless setting. As exhibited in Figure 5 (a), both the training and validation accuracies of the conventional quantum kernel classifier fast converge to 100% after 80 epochs. The test accuracy also reaches 100%, highlighted by the green marker. Meanwhile, the loss L decreases to 0.24. These results indicate that the conventional quantum kernel classifier with the protocol as depicted in Figure 2 (b) can well learn the synthetic dataset D.
The hyper-parameters of QAS under the noiseless setting are identical to the noisy setting introduced in the main text. Specifically, we set T = 400 and W = 1 in the training stage (Step 2), K = 500 in the ranking stage (Step 3), and T = 10 in the retraining stage (Step 4). Figure 5 (b) demonstrates the output ansatz in Step 3. Compared to the conventional quantum kernel classifier, the output ansatz includes fewer CNOT gates, which is more amiable for physical implementations. Figure 5 (c) illustrates the learning performance of the output ansatz in the retraining stage. Concretely, both the training and test accuracies converge to 100% after one epoch. These results indicate that QAS can well learn the synthetic dataset D under the noiseless setting. Note that for all simulation results related to classification tasks, the Adam optimizer [53] is exploited to update the training parameters of the quantum kernel classifier and QAS. The learning rate is set as 0.05. We end this subsection by explaining the fierce competition phenomenon encountered in the optimization of QAS. Namely, when the number of supernets is 1, some ansatze that can achieve high classification accuracies with independently training, will perform poorly in QAS. To exhibit that QAS indeed searches a set of ansatze (quantum circuit architectures) with high classification accuracies, we examine the correlation of the performance of the ansatz with independently optimization and training by QAS. In particular, we randomly sample 500 ansatze from all possible architectures and evaluate the widely-used Spearman and Kendall tau rank correlation coefficients [61,62], which are in the range of [0, 1]. In particular, larger correlation coefficients (or equivalently, stronger correlations) indicate that the ranking distribution achieved by QAS is consistent with the performance of different circuit architectures with independently training. Moreover, larger correlation coefficients also imply that the output ansatz of QAS can well estimate the target ansatz a * in Eqn. (3).
The Spearman rank correlation coefficient ρ S quantifies the monotonic relationships between random variables r and s. Specifically, the spearman rank correlation coefficient between r and s is defined as where cov(·, ·) is the covariance of two variables, and σ r (σ s ) refers to the standard deviations of r (s). Suppose that r ∈ R n and s ∈ R n are two observation vectors of r and s, respectively, the explicit form ρ S is When the Spearman rank correlation is employed in QAS, the observation vector r (s) corresponds to the achieved validation accuracy of the sampled 500 ansatze in the ranking stage, while the observation vector s corresponds to the achieved validation accuracy of the sampled 500 ansatze with independently training. The Kendall tau rank correlation coefficient concerns the relative difference of concordant pairs and discordant pairs. Specifically, in QAS, denote r (s) as the observation vector that refers to the achieved validation accuracy of the sampled 500 ansatze in the ranking stage (with independently training). Given any pair (r i , r j ) and (s i , s j ), it is said to be concordant if (r i > r j ) ∧ (s i > s j ) or (r i < r j ) ∧ (s i < s j ); otherwise, it is disconcordant. According to the above definition, the explicit form of the Kendall tau rank correlation coefficient is where sign(·) represents the sign function. Table I summarizes the correlation coefficients with n = 500. Specifically, when the number of supernets is 1, we have ρ K = 0.113, which implies that the correlation between r and s is very low. By contrast, with increasing the number of supernets to 5 and 10, the correlation coefficients ρ S and ρ K are dramatically enhanced, which are 0.723 and 0.536, respectively. Moreover, when the number of supernets is W = 10 and the number of iterations is increased to T = 1000, the correlation coefficients ρ S and ρ K can be further improved, which are 0.774 and 0.591, respectively. These results indicate that the competition phenomenon in QAS can be alleviated by introducing more supernets and increasing the number of training iterations. In doing so, the performance of ansatze evaluated by QAS can well accord with their real performance with independently training.

The performance of QAS towards the noise model extracted from the real quantum devices
We evaluate the classification accuracy of the quantum classifier equipped with the hardware-efficient ansatz and the ansatz searched by QAS under the noise model extracted from a real quantum device, i.e., an IBM's 5-qubit quantum machine named 'Ibmq_lima'. The qubit connectivity of the deployed quantum machine is illustrated in Figure 6 and its system parameters are summarized in Figure 7.
The implementation details are as follows. The construction of the synthetic dataset is identical to those introduced in Supplementary B 2, except for setting the feature dimension as 5 instead of 3. For the quantum classifier with the hardware-efficient ansatz, the number of layers and the number of epochs are set as L = 3 and T = 400, respectively. The hardware-efficient ansatz used in the baseline experiment takes the form U (θ) =  learning performance of QAS. In the first setting, we set the number of supernets as W = 1 and the number of epochs in the optimization stage as T = 10. In the second setting, we set the number of supernets as W = 5 and the number of epochs as T = 400. For both settings, the number of layers is set as L = 3 and the number of the sampled ansatze at the ranking stage is K = 500.
The simulation results are exhibited in Figure 8. For the quantum classifier with the hardware-efficient ansatz, the achieved test accuracy is 68%. We utilize this test accuracy as the baseline to quantify the learning performance of QAS. As shown in the left panel of Figure 8, for the first setting (i.e., T = 5 and W = 1), there are in total 19 ansatze out of K = 500 ansatze achieving a higher accuracy beyond the baseline. When we increase the number of epochs and the number of supernets to T = 400 and W = 5 (i.e., corresponding to the second setting), there are in total 58 ansatze out of K = 500 ansatze surpassing the baseline. Meanwhile, the average performance over the sampled K = 500 ansatze is better than the first setting. As shown in the middle panel of Figure 8, when we retrain the searched ansatz in the second setting (depicted in the right panel of Figure 8) with 8 epochs, the test accuracy improves to 81%. These observations validate the effectiveness of QAS of enhancing the learning performance of VQAs towards classification tasks. Moreover, increasing the number of supernets W and the number of epochs T contributes to improve the capability of QAS.

C. Experimental Details of the ground state energy estimation
In this section, we first briefly recap the ground state energy estimation task in Supplementary C 1. In Supplementary C 2, we compare the performance of QAS and conventional VQE towards the ground state energy estimation task when they are implemented on real quantum hardware.

The ground state energy estimation
A central application of VQAs is solving the electronic structure problem, i.e., finding the ground state energies of chemical systems described by Hamiltonians. Note that chemical Hamiltonians in the second quantized basis set approach can always be mapped to a linear combination of products of local Pauli operators [51]. In particular, the explicit form of the molecular hydrogen Hamiltonian H h in Eqn. (6) is The goal of the variational Eigen-solver (VQE) is generating a parameterized wave-function |Ψ(θ) to achieve The linear property of H h in Eqn. (C1) implies that the value | Ψ(θ)|H h |Ψ(θ) can be obtained by iteratively measuring |Ψ(θ) using Pauli operators in H h , e.g., such as | Ψ(θ)|I 8 ⊗ Z 0 |Ψ(θ) and | Ψ(θ)|X 0 Y 1 Y 2 X 3 |Ψ(θ) . The lowest energy of H h equals to E m = −1.136 Ha, where 'Ha' is the abbreviation of Hartree, i.e., a unit of energy used in molecular orbital calculations with 1 Ha = 627.5kcal/mol. The exact value of E m is acquired from a full configuration-interaction calculation [51]. We note that the quantum natural gradient optimizer [8], which can accelerate the convergence rate, is employed to optimize the trainable parameters for both VQE and QAS, where the learning rate is set as 0.2.

The performance of QAS on real quantum devices
Here we carry out QAS and the conventional VQE on IBM's 5-qubit quantum machine, i.e., 'Ibmq_ourense', to accomplish the ground state energy estimation of H h . The qubit connectivity of 'Ibmq_ourense' is illustrated in Figure  9, and the system parameters of these five qubits are summarized in Figure 10.  The implementation detail is as follows. The hyper-parameters of QAS are L = 3, W = 10, K = 500, and T = 500. To examine the compatibility of QAS, we restrict its searching spaces to be consistent with the qubit connectivity of 'IBM_ourense', i.e., the single-qubit gates are sampled from R Y and R Z , and CNOT gates can conditionally apply to the qubits pair (0, 1), (1, 0), (1, 2), (2, 1), (1, 3), and (3, 1), based on Figure 10. We call this setting as QAS with the real connectivity (QAS-RC). Under such a setting, the number of all possible circuit architectures for QAS-RC is 1024 3 . The hyper-parameters setting for VQE are L = 3 and T = 500. The heuristic circuit architecture used in VQE is identical to the case introduced in the main text (Figure 3 (a)). In the training process, we optimize VQE and QAS on classical computers under a noisy environment provided by the Qiskit package, which can approximately simulate the quantum gates error and readout error in 'Ibmq_ourense'. The reason that we move the training stage on the classical numerical simulators is because training VQE and QAS on 'Ibmq_ourense' will take an unaffordable runtime, due to the fair share run mode [49].
The training performance of VQE and QAS-RC is demonstrated in Figure 11. In particular, as shown in the left panel, the estimated ground energy by VQE is around −1.02 Ha after 30 iterations, highlighted by the dark blue line. The performance of QAS-RC is shown in the right panel. Concretely, when we retrain the output ansatz with 15 iterations, its estimated energy slightly oscillates around −1.04 Ha, highlighted by the green solid line. When we implement the optimized VQE and the optimized output ansatz of QAS-RC on the real quantum device, i.e., 'Ibmq_ourense', their performances are varied. Specifically, as demonstrated in the left panel of Figure 11, the estimated ground energy by VQE is −0.61 Ha (highlighted by the blue marker), while the estimated ground energy by QAS-RC is −0.963 Ha (highlighted by the green marker). Compared with VQE, the estimated result of QAS-RC is much closer to the exact result. We utilize the following formula to quantify the relative deviation between the simulation and experiment results. Denoted the estimated energy obtained by the numerical simulation as E s and the test energy achieved by 'Ibmq_ourense' as E t , the relative deviation follows where E m = −1.136 is the exact result. Following this formula, the relative deviation for VQE and QAS-RC is 36.1% and 6.8%, respectively. Compared with the heuristic circuit architecture used in VQE, QAS that concerns the real qubits connectivities can dramatically reduce the relative deviation. The above results not only indicate the compatibility of QAS, but also demonstrate that QAS can well adapt to the weighted gates noise and achieve a high performance towards quantum chemistry tasks. Finally, we compare the output ansatz of QAS-RC with the heuristic circuit architecture used in VQE. The simulation results of QAS-RC in the ranking stage are summarized in Figure 12. In particular, the left panel (a) exhibits the ranking distributions of QAS-RC, where the estimated ground energy of most ansatze concentrates on [−0.6 Ha, −0.4 Ha]. Figure 12 (b) shows the output ansatz of QAS-RC, where the corresponding circuit implementation on 'IBM_ourense' is exhibited in Figure 12 (c). Compared with the heuristic circuit architecture used in VQE ( Figure  3 (a)), the output ansatz of QAS-RC contains fewer CNOT gates. This implies that QAS-RC has the ability to appropriately reduce the number of two-qubit gates to avoid introducing too much error, while the expressive power of the trainable circuit U (θ) can be well preserved. In other words, QAS can adapt to the weighted gate noise to seek the best circuit architecture.

D. Improving the ranking stage of QAS
Recall the ranking stage of QAS, i.e., Step 3 of Figure 1, is uniformly sampling K ansatze from the supernet A. The aim of this step is sampling the one, among the sampled ansatz, with the best performance. However, the uniformly sampling method implies that the sampled ansatze maybe come from S bad with a high probability when |S bad | > |S good |. It is highly desired to devise more effective sampling methods.
Here we utilize an evolutionary algorithm, i..e, nondominated sorting genetic algorithm II (NSGA-II) [63], to facilitate the ansatze ranking problem. The intuition behind employing NSGA-II is actively searching potential ansatze with good performance instead of uniformly sampling ansatze from all possible circuit architectures. Note that several recent studies, e.g., Refs [31,64], have directly utilized the evolutionary and multi-objective genetic algorithms to complete ansatz design. We apply QAS with the evolutionary algorithm to tackle the ground state energy estimation problem described in the main text. Note that all hyper-parameters settings are identical to the uniformly sampling case, except for the settings related to the evolutionary algorithm. Particularly, we set the population size as N pop = 50 and the number of generations as G T = 20. The simulation results under the noiseless setting are shown in Figure 13. In particular, QAS assisted by NSGA-II searches in total 943 ansatze, and the estimated energy of 143 ansatze (15.2%) lies in the range from −1 Ha to −1.2 Ha. By contrast, QAS with uniformly sampling strategy only finds 3 ansatze among in total 500 ansatze (0.6%) in the same range. This result empirically confirms that evolutionary algorithms can advance the performance of QAS. We remark that other advanced machine learning techniques such as reinforcement learning [65] can also be exploited to benefit the performance of QAS.
E. An empirical exploration for the trainability of QAS Here we empirically investigate the trainability of QAS through the lens of barren plateaus [22,39,40,66]. Recall the main conclusion of the barren plateaus is that the gradient vanishes exponentially in the qubits count N . Mathematically, the expectation of the gradient norm of the objective function in Eqn. (1) tends to be zero and the corresponding variance will fast converge to zero with respect to N , i.e., Var θ ( ∇ θ L(θ) ) ∼ O(e −LN ). With this regard, barren plateaus can be utilized as a measure to quantify the trainability of quantum algorithms. That is, when an algorithm experiences a less impact of barren plateaus, it could possess a better trainability.
Following the above explanations, we conduct the following numerical simulations to demonstrate that the alleviation of barren plateaus in QAS. In particular, we compare the variance of the gradient norm, i.e., Var θ ( ∇ θ L(θ) ), with respect to the hardware-efficient ansatz and the ansatze pool implied by QAS. The mathematical expression of the objective function is where the observable H equals to I 2 N −1 ⊗ |0 0|, the input state is ρ = (|0 0|) ⊗N , and U (θ) corresponds to the hardware-efficient ansatz or the ansatz explored in QAS. For the hardware-efficient ansatz, we set the layer number as L = 3, i.e., U (θ) = L l=1 U l (θ) and the implementation of U l (θ) is shown in Figure 14 (a). The calculation of Var θ ( ∇ θ L(θ) ) is completed by randomly sampling θ from a uniform distribution with 2000 times. For QAS, the ansatze pool S is constructed by tailoring the hardware-efficient ansatz U (θ) introduced above. As shown in Figure 14 (b), for each U l with l ∈ [L], there are two choices of the single-qubit gates (i.e., R Y and R Z ) and two choices of the two-qubit gates (i.e., CNOT and an identity operation). The calculation of Var θ ( ∇ θ L(θ) ) is completed by sampling 2000 different ansatzes and sampling one random θ from a uniform distribution for each ansatz. The number of qubits N ranges from 2 to 10.
The simulation results under the noiseless setting are shown in Figure 14 (c). For the hardware-efficient ansatz, the variance of the gradient norm is continuously decreased with respect to the increased N . This result can be treated as an evidence of barren plateaus. By contrast, for the ansatze pool explored by QAS, the variance of the gradient norm for N = 4, 6, 8, 10 is almost the same with each other. Meanwhile, for the same N , the variance of the gradient norm corresponding to the ansatze pool explored by QAS is always higher than that of the hardware-efficient ansatz. Recall that Ref. [22] states that the variance of gradients is continuously deceased with respect to the increased N and L, which induces the barren plateau phenomena for the sufficiently large N and L. Nevertheless, according to the simulation results in Figure 14 (c), QAS does not obey such a tendency. These observations imply the potential of QAS to alleviate the influence of the barren plateaus.

F. Progressive QAS for solving large-scale problems
The proposed QAS introduced in the main text is only a prototype towards automatically seeking a good ansatz instead of the handcraft design. Namely, even though QAS utilizes weight sharing strategy to reduce the parameter space to O(dLQ N ), there still exists an exponential dependence with N . This exponential relation implies that in certain cases, the searched ansatz by QAS may not well estimate the optimal ansatz a * when N becomes large within a reasonable runtime complexity. In this section, we devise a variant of QAS, termed as progressive QAS (Pro-QAS), to dramatically improve the learning performance of QAS for large-scale problems.

Algorithmic implementation of Pro-QAS
The key concept behind Pro-QAS is narrowing the size of ansatze pool to ensure its performance. Different from QAS that directly samples an ansatz from S to conduct optimization, Pro-QAS seeks the targeted ansatz in a progressive way. Namely, given a hardware-efficient ansatz U (θ) in Eqn. (2), Pro-QAS first freezes the gate arrangement of U l (θ) with l = 1 and search the best gate arrangement of U l=1 (θ). Such a searching process is the same with Step 2 (optimization) and Step 3 (Ranking K ansatze) in the original QAS. Once the search is completed, Pro-QAS begins to optimize the gate arrangement at the second layer U l=2 (θ) and freezes the rest L − 1 layers. After progressively optimizing the gate arrangement of L layers, the established ansatz a (T ) and its corresponding parameters θ (T ) are used to approximate the optimal result (a * , θ * ) in Eqn. (1). Notably, similar ideas of progressively constructing and optimizing ansatz have been exploited in Refs. [35,[67][68][69]. Note that Ref. [70] observed that an abrupt transition phenomenon for the progressive strategy. That is, when the cost function has the identity extrema and the number of layers is less than a critical value, the layer-wise training strategy could lead to an unfavorable performance. These results can be employed as guidance to improve the learning performance of Pro-QAS. For instance, the cost function adopted in Pro-QAS should be carefully designed to avoid the identity extrema.
We then analyze the required runtime complexity and the reduced search space of Pro-QAS. Compared with the original QAS, the only difference of Pro-QAS is involving an extra outer loop to progressively optimize L layers. Hence, in conjunction with the runtime complexity cost of QAS derived in Method, we conclude that the execution of Pro-QAS takes at most O(dQN L) memory and O(QN L 2 ) runtime. As for the size of search space, the progressively searching strategy decreases the size of the ansatze pool to O(LQ N ), which is exponentially less than that of QAS in terms of L. Remarkably, such space can be further reduced when we progressively search the gate arrangement of each layer along the index of qubits. In this way, the search space of possible ansatze scales with O(QN L), while the price to pay is linearly increasing the runtime cost with respect to N .

Numerical simulation results of Pro-QAS
We conduct numerical simulations to demonstrate the capability of the proposed Pro-QAS towards large-scale problems. In particular, we apply Pro-QAS to achieve a binary classification task. The construction rule of the dataset D = {x (i) , y (i) } 300 i=1 mainly follows Supplementary B 2, where the only difference is enhancing the feature dimension of the input example from 3 to 7 and 10, respectively. In other words, the number of qubits to load the input example x (i) is N = 7 (or N = 10), which is remarkably larger than the classification task discussed in the main text with N = 3.
The hyper-parameters setting are as follows. For the case of N = 7, the number of supernet is set as W = 1 and the layer number is L = 5. The number of epochs in Step 2 is set as T = 200. The allowed types of quantum gates are {R Y , R Z , CNOT} with Q = 3 and the qubits connectivity follows the chain structure. This setting implies that the total number of ansatze without any operation is |S| = 2 40 . The depolarization channel is employed to simulate the quantum system noise. The depolarization rates for the single-qubit and two-qubit gates are set as p = 0.05 and p = 0.1, respectively. The number of sampled ansatze at the ranking stage is K = 128. For the case of N = 10, all settings are the same with the above one, except for setting L = 3. To this end, the total number of ansatze is |S| = 2 33 . The simulation results for the case of N = 7 are exhibited in Figure 15. As a reference, we employ the hardwareefficient ansatz with the identical layer number L = 5, as shown in the left subplot, to learn the same dataset D [11].