Global optimization of quantum dynamics with AlphaZero deep exploration

While a large number of algorithms for optimizing quantum dynamics for different objectives have been developed, a common limitation is the reliance on good initial guesses, being either random or based on heuristics and intuitions. Here we implement a tabula rasa deep quantum exploration version of the Deepmind AlphaZero algorithm for systematically averting this limitation. AlphaZero employs a deep neural network in conjunction with deep lookahead in a guided tree search, which allows for predictive hidden-variable approximation of the quantum parameter landscape. To emphasize transferability, we apply and benchmark the algorithm on three classes of control problems using only a single common set of algorithmic hyperparameters. AlphaZero achieves substantial improvements in both the quality and quantity of good solution clusters compared to earlier methods. It is able to spontaneously learn unexpected hidden structure and global symmetry in the solutions, going beyond even human heuristics.

Recent progress on technologies with quantum speedup focuses largely on optimizing dynamical quantum cost functionals via a set of external classical parameters.Such research includes quantum variational eigensolvers [1], annealers [2], simulators [3,4], circuit optimization [5,6], optimal control theory [7][8][9], and Boltzmann machines [10].The minimized functional could be for example the energy of a simulated system, or the distance to a quantum computational gate.
A shared algorithmic feature is domain knowledge about where to search, such as near the Hartree-Fock Ansatz for variational eigensolvers, or in the analytical gradient direction.An open question in optimization research is how much this specialized approach can be supplanted by a problem-agnostic methodology: One which does not require expert knowledge, avoiding both the overhead in human labour [11] and the potential for local, suboptimal trapping [12][13][14].In other words, an autonomous machine learning approach has the potential to plan its solutions both strategically and tactically.
It has been argued that, due to the inherent smoothness of unitary quantum physics [15], local exploitation of quantum dynamics can be sufficient for efficiently finding good solutions [16].Local search has been especially successful in the well-established field of Quantum Optimal Control Theory (QOCT), enjoying a half century of continued progress in NMR [17], quantum chemistry [7,18], and spectroscopy [19].This has culminated in Hessian extraction approaches [20] that generally outperform other local methods [21,22].
Yet, similar to classical NP-complete problems [23], quantum functionals can suffer a phase transition [24] from easier to "needle in a haystack" instances that require global exploration of parameters.Mounting evidence has shown that critically constrained dynamics lead to such complexity [11,[24][25][26], especially as QOCT has veered into high-precision quantum computation [27], circuit compilation [28], and architecture design [29].It is therefore crucial to balance resources for exploitation of smooth, local quantum landscapes with state-of-the-art classical methods for domain-agnostic exploration.
In the literature, dynamics optimization is characterized by a lookahead-depth, i.e. how far into the future one plans current actions.A shallow depth may broaden exploration, a strategy typically found in Reinforcement Learning (RL) [30].This has been powerfully combined with Deep Neural Networks (DNN) [31][32][33][34][35] and applied recently to quantum systems [36? -42].Unfortunately, single-step lookaheads are inherently local and thus require a slower learning rate, with no performance gain found over full-depth, domain-specialized (Hessian approximation) methods in QOCT.Other full-depth methods have also had mixed success, e.g.Genetic Algorithms [43,44] and Differential Evolution [25], but they typically require careful fine-tuning since they are based on ad-hoc heuristics rather than being mathematically rooted.
A recent stunning breakthrough has been due to the AlphaZero class of algorithms [45][46][47].AlphaZero has already effectively outclassed all adversaries in the games of Go, Chess, Shogu, and Starcraft.The key to the success of AlphaZero was the combination of a Monte Carlo tree search with a one-step lookahead DNN.As a result, the lookahead information from far down the tree dramatically increases the trained DNN precision, and together they compound to produce much more focused and heuristic-free exploration.
Here, we implement and benchmark a QOCT version of AlphaZero for optimizing quantum dynamics.We characterize improvements in learning and exploration compared to traditional methods.We find a crossover from difficult problems where AlphaZero learning alone is ideal and those where a combination of deep exploration and quantum-specialized smooth exploitation is optimal.We show this leads to a dramatic increase in both the quality and quantity of good solution clusters.Our AlphaZero implementation retains the tabula rasa character of Ref. [46] in two important respects.Firstly, it efficiently learns to solve three different optimization problem classes using the same algorithmic hyperparameters.Secondly, we demonstrate that AlphaZero is able to identify quantum-specific heuristics in the form of hidden symmetries without the need for expert knowledge.

A. Unified quantum exploration algorithm
In this work, we seek to obtain pulse sequences that can unitarily steer a quantum system towards given desired dynamics.For our purposes, we quantify this task through the state-averaged overlap fidelity F(U (t)) with respect to a target unitary Ûtarget , Here, U (t) denotes the time evolution operator of the system, which solves the Schrödinger equation.We fix for concreteness our physical architeture as superconducting circuit QED [48], being both a highly tunable and potentially scalable architecture, with potential near-term applications [49].The system is chosen to be a resonatorcoupled two-transmon system, as depicted in Fig. 1a.
Here the transmon qubits are mounted on either side of a linear resonator and we drive the first qubit with an external control Ω, which could be a piecewice constant pulse as depicted in the bottom of the figure.The system dynamics are governed by the Hamiltonian [50] where b j is the qubit-lowering operator for transmon j, and the external control Ω(t) is shaped by the optimization algorithm to maximize (1), with Ûtarget = √ ZX being a standard entangling gate.Ûtarget with single qubit gates form a universal gate set, e.g. , for quantum computation on a surface code circuit-QED layout.We fix the parameters to be within typical experimental values (see e.g.Refs.[51,52]) for the qubit-qubit coupling J/2π = 5MHz and the detuning ∆/2π = 0.35GHz.
We consider three optimization classes to test a unified AlphaZero algorithm and benchmark it against both domain-specialized and domain-agnostic algorithms.These three correspond to control parameters Ω(t) that are digital, i.e. taken from a discrete set of possibilities; that can vary continuously as a function of continuous but highly-filtered controls; and lastly, piecewise constant controls, which is standard in the QOCT approximation.
Within the RL framework, an autonomous agent must interact with an environment that at each time step t inhabits a state s t .Here we choose the unitary Û (t) to represent this state.The agent then alters the unitary at each time step t by applying an action a t (here Ω(t)) that transforms the unitary Û (t) → Û (t + ∆t).The purpose of the agent is to maximize an expected score z at final time T , which we choose to be the fidelity z = F(U (T )).This is done by implementing a probabilistic policy π(s) = (π a1 , π a2 , . ..), which maps states s to probabilities of applying actions, i.e. π a = Pr(a|s).The agent attempts to improve the policy by gradually updating it with respect to the experience it gains.
Fig. 1b, c illustrate the tree search and the neural network for AlphaZero, respectively.The upper output of the neural network approximates the present policy for a given input state, i.e. p a ∼ π a .Meanwhile, the lower output provides a value function which estimates the expected final reward, that is v(s t ) ∼ F(T ).Both functions use only information about the current state and suffer from being lower-dimensional approximations of extremely high dimensional state and action spaces.The insight of the AlphaZero algorithm is to supplement the predictive power of the value function v(s t ) with retrodictive information coming from future action decisions in a Monte Carlo search tree.The tree depicted in Fig. 1b consists of nodes, which represent states (here depicted as pulses) and edges, which are state-action pairs (depicted as lines).At each branch in the tree, the algorithm chooses actions based on a combination of those with the highest expected reward and the highest uncertainty, a measure of which edges remain unexplored.Whenever  c A comparison between the infidelities obtained by AlphaZero and the GA at 60ns.For AlphaZero, each dot represents the infidelity obtained at the end of a unique episode, while for the GA each dot represents the highest scoring member in the population after each iteration.
new states (called leaf-nodes) are explored, the neural network is used to estimate the value of that node, and the information is propagated backward in the tree to the root node.The forward and backward traversals of the tree are described in greater detail in Methods.
In the manner described above, the predictive nature of the network is able to inform choices in the tree while the retrodictive information coming back in time is able to give better estimates of the state values already explored, which are then used to train the network.This reinforcing mechanism is thus able to globally learn about the parameter landscape by choosing the most promising branches while effectively culling the vast majority of the rest.The result is neither an exhaustive sampling at full depth, which would yield the true landscape albeit at a computationally untenable cost, nor is it an exhaustive sampling at shallow depth, which would require a prohibitively slow learning rate for information from the full depth of the tree to propagate back.Instead, AlphaZero intelligently balances the depth and the breadth of the search below each node.While the hidden-variable approximation given by the neural network and MC tree are certainly not exhaustive and cannot find solutions with exponentially small footprint, it is nonetheless able to discover patterns and learn an effective global policy strategy that produces robust, heterogeneous classes of promising solutions.In our implementation we restrict AlphaZero such that it can only find new unique solutions, which is done by cutting of branches in the tree that have previously been fully explored.
In what follows we apply the algorithm with unified hyperparameters to three optimization classes: Discrete, continuous, and continous with strong constraints.The three problem types accentuate different optimization strategies.In the discrete optimization case, we show how AlphaZero stands up against other domain-agnostic methods (where the gradient is not defined) and compare their abilities to learn structures in the parameters.For the constrained continuous pulses, we validate the hypothesis that the analytical gradient, while computable, is highly inefficient and indeed unable to find near global solutions that are at least as good as those found by AlphaZero.Finally, in the continuous-valued piecewiseconstant case, we show the balance between state-ofthe-art physics-specialized and agnostic AlphaZero approaches.We show that the combination of exploration and exploitation is able to produce new clusters of highquality solutions that are otherwise highly unlikely to be found, while learning hidden problem symmetry.

B. Digital gate sequences
As a first application with AlphaZero, we demonstrate optimal control using Single Flux Quantum (SFQ) pulses [44,53,54].The aim is to control the quantum system by using a pulse train that consists of individual, very short pulses typically in the pico-second scale.This technology originated as way of utilizing superconductors for largescale, ultrafast, digital, classical computing [55].At each time slice there either is a pulse or not, which implies that the unitary evolution is governed by two unitaries Û1 and Û0 .Hence, the pulse train can be stored as a digital bit string with 0 and 1 denoting no pulse and a single pulse respectively.SFQ devices are interesting candidates for quantum computation since they potentially allow for ultrafast gate operations as well as scalable quantum hard-ware [54].We model the pulses as ∆t = 2.0ps Gaussian functions , where τ = 0.25ps and a = 2π/1000.The pulse is depicted to the right in Fig. 2a.
The optimization task is to find the input string that maximizes the fidelity functional (1).The current approach for this type optimization is to apply a genetic algorithm (GA) [44,56,57].Besides GA and AlphaZero, we also compare two conventional algorithms, Q-learning and stochastic descent (SD) as in Ref. [24].Q-learning was one of the first RL algorithms developed, and applied recently to quantum control [24,37].It is a tabular-based algorithm that applies one-step updates in order to solve the optimal Bellman equation [58] (see Methods).SD is a time-local, greedy optimizer that changes the pulse at a randomly chosen time if this results in an increasing fidelity.
Our unified AlphaZero algorithm has an action space of 60 for the neural network, and thus we group together binary SFQ action choices of multiple time steps.For this purpose, we take larger steps in time, and the 60 action choices are given using bit strings from a randomly chosen basis (see Methods).We benchmark the different algorithms by using equal wall-time simulations.For all simulations presented in this paper, we used a wall-time of 50 hours on an Intel Xeon X5650 CPU (2.7 GHz) processor.Similar to Ref. [44] we use a population size of 70 with a mutation probability of 0.001 for the GA (see Methods).
The results are plotted in Fig. 2b.Amongst conventional approaches, we see the SD algorithm performs slightly better than the GA.We attribute this to the fact that the SD algorithm is a greedy exploitation algorithm, while the GA is an exploration algorithm performing random permutations.As with many exploration algorithms, learning can be quite slow.Meanwhile, the Qlearning algorithm performs especially poorly.However, this algorithm is a tabular-based method.Such methods are known to break down for larger search spaces.This is one reason why modern RL algorithms use deep neural networks instead, motivating also our use of AlphaZero.We emphasize that AlphaZero also contains a deep lookahead tree search, which we found crucial to the success of our RL implementation (having also tested DQN [31] against simpler problems).We see in Fig. 2b that Al-phaZero indeed performs dramatically better than the greedy approach, with over an order of magnitude improvement in the low error regime.We attribute this drop in error to the existence of a quantum speed limit (QSL) at or near 60ns, a minimum time for high-fidelity computation.This regime is known to be the most computationally difficult to optimize, with a high probability of local trapping [22,24,26].
AlphaZero and GA are both learning algorithms in the sense that they utilize previous obtained solutions in order to form new ones.We compare the learning curves for the two algorithms in Fig. 2c, where we have plotted the infidelity as a function of wall time at 60ns.For Al-phaZero, we use the infidelity after each episode, where each data point is unique.For GA, we use the best performing solution in the population after each iteration.Since GA is a relatively greedy algorithm it performs very well initially, but fails to explore the larger solution space as the members in the population converge upon a single class of solution and the learning curve flattens out.
In contrast, AlphaZero keeps a high level of exploration that ultimately allows it to reach a very large number of different high-fidelity solutions.

C. Constrained analog pulses
A common challenge within quantum optimization is achieving realistic and efficient controls when experimental limitations constrain the underlying dynamics.Such constraints become very important when high precision is required, e.g. for very high fidelity operation of quantum technologies.Here, we consider standard constraints on duration, bandwidth, and maximum energy.Such constraints can be expected to greatly increase the computational cost of Hessian approximation-based solutions, which are otherwise known to converge quickly [16] and generally outperform other greedy methods [21,22].The workhorse algorithm for this is GRAPE [8], with quasi-Newton [20] and exact derivative [59] enhancements being crucial to the state of the art and its super-linear convergence.
We model the bandwidth constraints via a convolution with a Gaussian filter function where Ω(t) denotes the filtered control function.Fig. 3a illustrates the effect of this filter.Here, a piecewise constant pulse (dark blue) with amplitudes a 1−4 is convoluted into a smooth pulse (light orange) via Eq.( 3).
Throughout the remainder of this paper, we constrain the pulse amplitude to lie between 0 and Ω max /2π = 1.0GHz.
Most commonly, GRAPE is applied to piecewise constant pulses, but it can be modified to include filtering [59,60], as we also do here.Each time-step is divided into a number of substeps (giving the resolution) and the filtered pulse is then approximated as being constant within each substep.This subdivision is depicted in Fig. 3a as light orange vertical lines.In order to obtain the gradient, GRAPE calculates the time-evolution unitary using matrix exponentiation at each substep.
Fig. 3b shows the error (infidelity) between the exact and discretized unitaries as a function of the resolution.If we seek errors below the desired gate error (10 −2 ), the resolution should be around a couple of hundred.This significantly impedes the performance of GRAPE for this type of problem, since it requires considerably more matrix multiplications.A different strategy is to limit the control to a set of discretized amplitudes whose corresponding unitary can be calculated in advance and then apply a discretized optimization algorithm such as Al-phaZero.In order to do so, we apply a two-action update strategy, where we propagate from half the previous pulse to halfway into the next one.So, if the previous action was a 2 and the next one a 3 then the unitary U 2,3 would correspond to the shaded region in Fig. 3a.Here we ignore negligible contributions from adjacent pulses.For instance, calculating U 2,3 would be independent of a 1 and a 4 .Here we limit the amplitude to 60 different values (out of a continuous set), hence this methods requires calculating 60 2 = 3600 unitaries, which we do in the beginning of the simulation.
In our comparison between AlphaZero and GRAPE, we choose 4.0ns convoluted pulses using σ = 0.7ns.For GRAPE, we choose a resolution of 200.Fig. 3c shows the results of an equal wall-time simulation.Here, AlphaZero obtains a systematic improvement over its domain-specialized counterpart.At 96ns, AlphaZero outperforms GRAPE with an improvement that is significantly above one order of magnitude.Interestingly, both graphs shows significant fluctuations, which we attribute to the difficulty of the optimization task itself caused by the highly constrained dynamics.This is likely compounded by the random initialization of the neural network which can effect the convergence properties of Alp-haZero.Despite these fluctuations, AlphaZero performs significantly better in the regime of interest corresponding to infidelities below 10

Gate duration[ns]
Success-fraction AZ Hybrid GRAPE

D. Piecewise-constant analog pulses
So far, we have considered problems where gradient searches have not been applicable (digital sequence) or where gradient searches become inefficient (constrained analog pulses).For specific tasks where highly specialized algorithms exist and are known to perform relatively well, domain-agnostic algorithms typically perform inadequately.Thus, to properly benchmark our algorithm we have also considered the domain of piecewise constant pulses, a scenario where GRAPE typically performs extremely well due to the presence of high-frequency components and the limited number of matrix multiplications.In the following we hence focus on picewise constant pulses where we choose a single step duration of 2ns.
In this scenario, we characterize the performance of the exploitation and exploration algorithms in terms of both the variety of solutions found and the quality of the solutions.At first, we compare the algorithms already discussed, namely Q-learning, Stochastic Descent, Alp-haZero, and GRAPE.Fig. 4a shows GRAPE is able to outperform the other algorithms for piecewise constant pulses.However, AlphaZero still performs well despite its limitation of only having amplitude-discretized controls.To improve the AlphaZero algorithm further we conceive a hybrid algorithm where GRAPE optimizes the solutions found by AlphaZero.The hybrid algorithm, which is given the same wall-time as the others, is also plotted in Fig. 4a.Here the hybrid algorithm shows a significant improvement over GRAPE near 60ns, which we again relate to the presence of a quantum speed limit where the optimization task becomes difficult due to induced traps in the fidelity landscape.It is also worth noting that the optimization curve flattens out and the two algorithms again perform equally well when the pulse duration goes beyond 62ns.We attribute this to the existence of a secondary QSL, i.e. further improvement below 10 −4 in infidelity requires gate durations beyond 200ns (not plotted here).We also quantify the number of successful solutions found by either GRAPE or the hybrid AlphaZero algorithm, which we define as solutions having infidelities within four times the lowest infidelity obtained.The fraction of successful solutions are plotted in Fig. 4b.Here the improvement is even more substantial.At 60ns, we find almost three orders of magnitude more successful solutions compared to GRAPE with random seeding.The fact that the GRAPE-curve dips around 60ns seems to confirm our previous statement about the QSL in the sense that this is a combinatorially harder region to obtain relatively good solutions.Having a large number of good solutions is especially important because experimentally it may be that some are better suited or some provide additional advantages.
To further investigate the differences between the two algorithms, we compare the exploration of the control parameter landscape using a two-dimensional embedding provided by the t-SNE visualization method [11,61].
We do a single t-SNE analysis at 60ns, plotted in Fig. 5, which we have separated for clarity into different figures for GRAPE (a), the Hybrid before optimization (b), and after optimization (c).Here the color scale depicts the infidelity.Strikingly, the two algorithms seem to prefer entirely different portions of the landscape.GRAPE mostly finds solutions to the left in the t-SNE representation, but its high performing solutions are actually to the right.Interestingly, AlphaZero primarily finds solu-tions in the right region, which implies that AlphaZero has identified an underlying basic generic structure of good solutions.When all the AlphaZero solutions are optimized this leads to a large quantity of high performing solutions that inhabit the same region in the t-SNE representation.
We also see that the hybrid solutions naturally cluster towards some general basins of attraction.This suggests that AlphaZero has not converged on a single class but multiple different classes of solutions with different underlying physics.Some pulses from different clusters are depicted, showing some resemblance to typical bang-bang sequences.The different clustering that occurs demonstrates that a global exploration has indeed taken place, effectively finding different classes of solutions in different parts of the landscape.
We further test the hypothesis that AlphaZero has found underlying structure that supersedes a shallow heuristic search.
Note that the solutions seem to have at least some symmetry with respect to a reflection around the center of the time-axis.In fact, this symmetry already exists in the control problem.Since the Hamiltonian is real and the target its own transpose, the fidelity is unchanged if the pulse sequence is reversed i.e.
. However, it is not a priori clear that satisfying this symmetry is a good control strategy.We quantify the degree of time-asymmetry in the pulses via the measure where C = 0 implies pulses that are completely palindromic, i.e. symmetric with respect to reversion of the sequence.We plot in Fig. 6 the infidelity and the asymmetry for the two algorithms i.e. for GRAPE using random seeding (a) and the Hybrid, i.e.GRAPE using the AlphaZero solutions (b).Here the color scale depicts the iteration number.The first thing to notice is that high fidelity solutions tend to maintain this symmetry.The second feature is that GRAPE often only partially satisfies this symmetry.In contrast, AlphaZero learns over its training to increasingly prefer this symmetry, moving towards the bottom left of the plot.After post-optimization using GRAPE, the solutions improve significantly in infidelity and move ever further to the bottom left emphasizing this trend.We conclude that AlphaZero has identified this underlying symmetry specific to the problem instance we have chosen.Naturally, hard-coding such heuristics would not only be inefficient, but for many problems finding symmetries is nontrivial.Using deep learning, AlphaZero is able to learn these hidden symmetries without the need for human intervention.We therefore expect that AlphaZero's ability to learn hidden problem structures generalizes to other problems as well.

II. DISCUSSION
From our three examples, we conclude that the Al-phaZero methodology of combining neural network and guided tree search reinforces global information about good solutions that can also mark a significant algorith-mic advantage for quantum optimization.This is true for specific problems, but especially when comparing across a range of problems.None of the other algorithms we have considered are able to do well on all three problems, be it with heuristic, machine learning or domain-specialized approaches.
The three problems considered marked different optimization tasks, but AlphaZero is able to find high fidelity solutions with a single set of algorithmic hyperparameters.This suggests that learning the control landscape can be performed with minimal expert knowledge about the physical problem.
This conclusion is further enforced by the realization that hidden symmetries in the dynamics can be effectively learned by AlphaZero during its training.Such unexpected symmetries are not trivial to find for many Hamiltonians and would require significant human intervention even where they can be found.More over, hardcoding such heuristics into optimization algorithms can have many pitfalls, limiting broad exploration and potentially leading to suboptimal trapping in the optimization landscape.
Nonetheless, because the deep exploration methodology is by design agnostic to expert knowledge, it is most powerful when combined with specialized knowledge about locally exploiting promising seeds, leveraging the vast body of literature about local quantum optimization.This tradeoff between exploitation and exploration is a common trend in reinforcement learning and optimization in general.For example, in AlphaZero's chess matches with its competing AI, Stockfish [62], the latter was trained with sophisticated domain knowledge and thus was generally acknowledged as outperforming in the final moves of games.Combining the domain-agnostic exploration of the former with the domain-specialized exploitation of the latter seems like a common sense solution, as we have done here in the quantum dynamics case.An even tighter integration of the two approaches that examines the tradeoffs during different learning stages may also be promising.Alternatively, one could also also relax the tabula rasa character of the learning to enhance the exploration abilities using specialized knowledge.Supervised learning can in principle speed up the initial learning phase, perhaps most seemlessly when integrated with other broad exploration strategies, for instance crowd sourcing [11,63].
In this work we have considered digital, constrained, and underconstrained optimization of controlled quantum dynamics in the context of the design and execution of physical quantum-mechanical devices.This choice was deliberately made because the most advanced algorithms exist in this field owing to half a century of dedicated research.That being said, many of the more abstract and potentially groundbreaking dynamics algorithms, including those used in the design of digital sequences of quantum circuits or for analog evolutions in annealers and variational eigensolvers, can be seen as direct analogues of the algorithmic framework illustrated here.

Reinforcement Learning
A general RL setup consists of an environment and an agent.At each time step t, the environment is characterized by a state s t .Given s t , the agent selects an action a t that changes the environment to a new state s t+1 .Based on this change the agent receives a feedback signal called a reward, r t+1 ∈ R. The agent must learn how to maximize the sum of rewards it receives during an episode.This is done by implementing a policy π, which is a mapping from all states of the environment to probabilities of selecting possible actions Pr(a|s) = p a (s).The state-value function describes the quality of a given policy which is simply the expected sum of future reward staring from state s and subsequently following the policy π.Given two policies π and π we say that π ≥ π if v π (s) ≥ v π (s) for all states s.
The task considered here is to a construct a pulse sequence, which realizes a target unitary.At each time step, the agent must select an action that updates the unitary representing the state of the system.At each time step, the reward is zero except at the last step where it is simply the fidelity given by equation (1).

AlphaZero implementation
AlphaZero is a policy improvement algorithm that combines a neural network with a Monte Carlo Tree Search (MCTS) as depicted in Fig. 1 b and c [46,47].The neural network maps from states to policies p = (p 1 , p 2 , . ..) and values v.The MCTS, guided by the neural network, also computes a policy π that the actions are drawn from.At each time step, the policy π is stored in a replay buffer.At the end of an episode, the final score z = t r t is also stored in the buffer.Training of the neural network uses data drawn uniformly at random from the replay buffer in order to let the network predictions (p, v) approach the stored data (π, z).This is done by minimizing the loss function where the last term denotes L2 regularization with respect to the network parameters θ.
A MCTS is a way of looking several steps ahead by only visiting a small subset of possible future states.The tree is built by nodes (states) connected to each other by edges (state-action pairs).Each edge has four numbers associated with it: The number of visits N (s, a), the total action value W (s, a), the mean action value Q(s, a), and a prior probability of selecting set edge P (s, a).Starting from the root node (initial state), a single tree search moves through the tree by selecting actions according to a t = arg max a (Q(s t , a)+U (s t , a)), where U (s t , a) denotes an uncertainty given by Here c puct denotes a parameter determining the level of exploration.If a terminal node or a leaf node (i.e. a notpreviously-visited state) is encountered, the search stops.
The tree is expanded in the latter case by adding the node and initializing its edges as N (s, a) = W (s, a) = Q(s, a) = 0 and P (s, a) = p a , where p a is given by the neural network.The rest of the tree is updated by using the state-value v in a backwards pass through all the visited edges since the root node according to N (s, a) ← N (s, a) + 1, W (s, a) ← W (s, a) + v, and Q(s, a) ← W (s, a)/N (s, a).After a pre-set number of such searches have been conducted, an actual policy is calculated according to where s 0 is the root state and τ denotes a parameter controlling the level of exploration, which is annealed during the simulations.The action in drawn from the policy and the rest of the tree is reused for subsequent searches during the episode.For all tasks presented in this paper we used the same algorithmic parameters.The learning rate was 0.01, c puct = 1.0, and τ was hyperbolically annealed from 1.0 using an annealing rate of 0.001.After τ was annealed below a value of 0.90 we switched to deterministic policies by setting the largest policy value to one and the others zero.The neural network was a simple feed forward network where the hidden nodes consisted of four layers.Each layer contained 400 nodes followed by batch normalization and a rectified linear unit.Both the policy and the value head of the neural network consisted of a single hidden layer as well, where the policy head ended in a sigmoid-layer with same dimension as the action space and the value head ended in a single linear node.The L2 regularization parameter was c = 0.001 and we used stochastic gradient descent (SGD) for training the network.Similar to the AlphaZero paper [46] we achieve more exploration by adding Dirichlet noise to the search probabilities for the root nodes P (s, a) = (1 − )p a + η a , where η ∼ Dir(0.03) and = 0.25.

GA implementation
A genetic algorithm (GA) works by iteratively updating a population of solutions, which are bit strings [56,57].A GA generates new solutions based on the old population via processes inspired by biological evolution, namely crossover and mutations, which respectively combine two parent solutions and flip individual bits at random.If any improved solutions are found, these replace the worst ones in the population.Similar to Ref. [44], we used a population size of 70 and a mutation probability of 0.001.At each iteration we would select 2 × 30 parent solutions.

Q-learning implementation
Similar to equation ( 5) one can define an action-value function which is the expected reward if we choose action a from state s and then follow the policy π [30].Q-learning is a tabular-based RL algorithm, which approximates the optimal action-value function i.e. the action-value function for the optimal policy π opt = max π v π (s).The approximation Q(s, a) is initialized at random and subsequently updated according to where α denotes the learning rate.Similar to Ref. [24] we choose our state to be a tuple of time and control s = (t, Ω).The learning rate was α = 0.001 and we followed an epsilon-greedy strategy with linear annealing of epsilon [30].

Cross Resonance Gate
The cross resonance (CR) gate [50,64,65] is currently the standard fixed-frequency qubit entangling gate used on transmon systems.Its main advantage is avoiding the overhead associated with magnetic (flux) tuning of the frequency [66,67], which can be a leading cause of dephasing.As illustrated in Fig. 1a, the physical setup we optimize includes two fixed frequency qubits that are coupled to each other via a transmission line resonator.The transmons [67] may be modelled as anharmonically spaced Duffing oscillators [68], resulting in an extended Jaynes-Cummings model Hamiltonian where b † 1,2 ( b1,2 ) and â † (â) are the transmon and cavity creation (annihilation) operators respectively.Here ω 1 = ω 2 is the transmon resonance frequency, δ 1,2 denotes the anharmonicity, ω r denotes the cavity resonance, and g 1,2 the transmon-cavity coupling.The transmons are directly driven by external control parameters Ω(t), increasing the controllability compared to earlier architectures that drive through the common cavity.The transition of the second qubit is then driven resonantly through the control line of the first [60].This model may be significantly simplified using the method in Ref. [50].After adiabatic elimination of the cavity and block diagonalization into the qubit subspace, the authors derive an equivalent equation (Eq.3.3), which is the same as our Eq.(2).
To see that the natural gate that is produced from this Hamiltonian is a √ ZX gate, a (Schrieffer-Wolff) perturbative expansion shows [65] that the leading coefficients in the effective driving terms are given by where Z and X are Pauli matrices acting on the respective qubits, I is the identity, and m is a hand-tuned crosstalk parameter.The single-qubit terms and higher order terms (not shown) must be decoupled in the control optimization in order to correctly implement the CR gate.

Digital pulses
For each time step, the evolution of the system is governed by either one of two unitaries Û0 and Û1 , which respectively corresponds to the amplitude being zero or not [69].We calculate these unitaries in advance by solving the Schrödinger equation numerically.The entire pulse sequence can be encoded as a bit string as illustrated to the left in Fig. 2a and the corresponding unitary can be calculated as Û (T ) = = 0) linearly from one (i = 1) to zero (i = 60).The 60 unitaries constitute the action space and the unitary is now calculated as Û (t) = t ≤t Û (a t ) .The 60 actions allows us to use the same neural network architecture as for piece-wise constant and filtered pulses which have the same input space dimension. a

Figure 1 .
Figure 1. a Circuit-QED architecture consisting of two qubits (colored boxes) mounted on either side of a transmission line resonator.The first qubit is directly driven at the resonance frequency of the second one for a cross-resonance gate.An example of a piecewise constant pulse is depicted below the setup.b The schematics of a Monte Carlo tree search.Here the nodes are depicted as pulse sequences and the edges as lines.A single search consists of a forward propagation, expansion, and a back-pass (see text).c The neural network architecture used in AlphaZero.The network takes the state (unitary) as input and outputs probabilities for selecting individual actions p = {p1, p2, . ..} and an estimate of the final score (fidelity) v.

Figure 2 .a
Figure 2. a Illustration of how a SFQ pulse train can be encoded into a bit string along with a zoom-in that depicts the exact shape of a single pulse.b Infidelity (1 − F) as a function of gate duration for different discrete optimization algorithms.cA comparison between the infidelities obtained by AlphaZero and the GA at 60ns.For AlphaZero, each dot represents the infidelity obtained at the end of a unique episode, while for the GA each dot represents the highest scoring member in the population after each iteration.

Figure 3
Figure 3. a A piecewise constant pulse (dark blue) convoluted by a Gaussian filter (light orange).Here σ = 0.7ns.b The error of the unitary as a function of its resolution.c Comparison between AlphaZero and GRAPE on the cross resonance gate using Gaussian filtered pulses.

Figure 4 .
Figure 4. a An equal wall-time comparison between the various algorithms.The AlphaZero (here abbreviated AZ) Hybrid is presented in the text.b The fraction of successful solutions found by AlphaZero Hybrid and the GRAPE algorithm.

Figure 5 .
Figure 5. Two-dimensional representation of the final pulse vectors at 60ns using the t-SNE algorithm.The color scale shows the infidelity of the pulses.a GRAPE with random seeding, b AlphaZero, c The Hybrid, i.e.AlphaZero solutions after being optimized with GRAPE.In the latter case, some example high-fidelity pulses are shown.

Figure 6 .
Figure 6.The seeds and the GRAPE optimization at 60ns for a random generated seeds and b AlphaZero's solutions i.e. the Hybrid.The figures plot the infidelity (1 − F) as a function of the asymmetry measure (4).The color scale depicts the iteration of the algorithm.