Coherent transport of quantum states by deep reinforcement learning

Some problems in physics can be handled only after a suitable ansatz solution has been guessed, proving to be resilient to generalization. The coherent transport of a quantum state by adiabatic passage through an array of semiconductor quantum dots is an excellent example of such a problem, where it is necessary to introduce a so-called counterintuitive control sequence. Instead, the deep reinforcement learning (DRL) technique has proven to be able to solve very complex sequential decision-making problems, despite a lack of prior knowledge. We show that DRL discovers a control sequence that outperforms the counterintuitive control sequence. DRL can even discover novel strategies when realistic disturbances affect an ideal system, such as detuning or when dephasing or losses are added to the master equation. DRL is effective in controlling the dynamics of quantum states and, more generally, whenever an ansatz solution is unknown or insufficient to effectively treat the problem. Many problems in physics do not have an exact solution method, so their resolution has been sometimes possible only by guessing test functions. The authors apply Deep Reinforcement Learning (DRL) to control coherent transport of quantum states in arrays of quantum dots and demonstrate that DRL can solve the control problem in the absence of a known analytical solution even under disturbance conditions.

Some problems in physics are solved thanks to the discovery of an ansatz solution, namely a successful test guess, but unfortunately there is no general method to generate one. Recently, machine learning has increasingly proved to be a viable tool to model hidden features and effective rules in complex systems. Among the classes of machine learning algorithms, deep reinforcement learning (DRL) [1] is providing some of the most spectacular results for its ability to identify strategies for achieving a goal in a complex space of solutions without prior knowledge [2]. Contrary to supervised learning, which has already been applied to quantum systems, such as the determination of high fidelity gates and to the optimization of quantum memories by dynamic decoupling, DRL has been proposed only very recently to maintain a physical system in its equilibrium condition [3]. To show the power of DRL, we apply it to the problem of coherent transport by adiabatic passage (CTAP) of a quantum state encoded by an electron through an array of quantum dots, whose ansatz solution is notoriously called counterintuitive because of its not-obvious barrier control gate pulse sequence. During the coherent adiabatic passage, the electron spends no time in the central quantum dot by the simultaneous modulation of the coupling between the dots which suitably drive the trajectory through the Hilbert space [4][5][6][7]. The system moves from an initial equilibrium condition to a different one represented by populating the last dot of the array.
By exploiting such ansatz solution of pulsing the barrier control gates between the dots in a "reversed order" with respect of what intuition would naturally suggest, the process displays truly quantum mechanical behavior, provided that the array consists of an odd number of dots. We already explored separately silicon-based quantum information processing architectures [8,9], including the coherent transport by adiabatic passage of multiple-spin qubits into double quantum dots [10], and heuristic search methods, such as genetic algorithms, to find a universal set of quantum logic gates [11,12], and separately the application of deep reinforcement learning to classical systems [13][14][15]. Here we demonstrate that DRL implemented in a compact neural network can, first of all, autonomously discover the analogue of the counter-intuitive gate pulse sequence without any prior knowledge, therefore finding a control path in a problem whose solution is far from the equilibrium of the initial conditions.
More importantly, such a method allows to outperform the analytical solutions proposed in the past in terms of process speed and when the system deviates from ideal conditions because of detuning between the quantum dots, dephasing and losses. Under such conditions, no analytical approach does exist, to the best of our knowledge. Here we exploit the Trust Region Policy Optimization (TRPO) [16] to handle the CTAP problem. First, we compare the results discovered by the artificial intelligence algorithm with the ansatz solution know from the literature. Next, we apply the method to solve the system when the ground states of the quantum dots are detuned, and when the system is perturbed by the interaction with the uncontrollable degrees of freedom of the surrounding environment, resulting in dephasing and loss terms in the master equation describing the system, for which there is no analytical method. Like for the case of the artificial intelligence learning in the classical Atari environment [2], agent here interacts with a QuTIP [17] simulation providing the environment of the CTAP by implementing the master equation of the system, and exploiting the information retrieved from the feedback in terms of the temporal evolution of the population of the dots. The use of a pre-trained neural network as a starting point to identify the solution of a modified master equation may further reduce the computation time by one order of magnitude. As a further advantages of such approach, a 2-step temporal Bayesian network analysis allows to identify which parameters of the system are more influencing the process. Our investigation indicates that a key factor is the appropriate definition of the reward function that deters the system from occupying the central quantum dot and rewards the occupation of the last quantum dot.

STATE
Reinforcement learning (RL) is a set of techniques used to learn how to behave in sequential decision-making problems when no prior knowledge about the system dynamics is available, or the control problem is too complex for classical optimal-control algorithms.
RL methods can be roughly classified into three main categories: value-based, policy-based, and actor-critic methods [1]. Recently, actor-critic methods have proven to be successful in solving complex continuous control problems [18].
The idea behind actor-critic methods is to use two parametric models (e.g., neural networks) to represent both the policy (actor) and the value function (critic). The actor decides in each state of the system which action to execute, while the critic learns the value (utility) of taking each action in each state. Following the critic's advice, the actor modifies the parameters of its policy to improve the performance. Among the many actor-critic methods available in the literature, we selected Trust Region Policy Optimization algorithm (TRPO) [16] to find an optimal policy of control pulses to achieve CTAP in a linear array of quantum dots. The choice of TRPO is motivated both by its excellent performance on a wide variety of tasks and by the relative simplicity of tuning its hyper-parameters [16] (see S.I.).
The coherent transport by adiabatic passage is the solid state version of a method developed for Stimulated Raman Adiabatic Passage (STIRAP) [7,19], relevant for instance in those quantum information processing architectures that require to shuttle a qubit from one location to another by paying attention to minimize the information loss during transport.
In solid-state quantum devices based on either silicon [20] or gallium arsenide [21], the qubit can be encoded, for instance, into spin states of either excess electron(s) or hole(s) in quantum dots [9]. CTAP was originally developed for single-electron states in single-occupied quantum dots, but it can be also extended to more complex spin states, such as for hybrid qubits based on triplets of spin. [10] If one imagines to employ, for instance, an array of dopants in silicon [22], a reasonable inter-dopant spacing is of the order of about 20 nm and hopping time of 100 ps [23]. The adiabatic passage needs control pulses with a lower bandwidth of an order of magnitude or two with respect to the hopping time, which can be managed by conventional electronics [24]. To demonstrate the exploitation of deep reinforcement learning, we start by the simplest case of CTAP across a chain of three identical quantum dots. The deep reinforcement learning architecture is depicted in Figure 1a. The simulation of the physical system that supports the CTAP consists in the environment E that receives as input the updated values of the parameters (in our case the coupling terms Ω i,i+1 with i = 1, 2 between the adjacent dots i th and (i +1) th ) that reflect the action on the control gates as calculated by the agent A according to the policy π. In turn, the environment E computes the new system state (here expressed in terms of the density matrix of the triple quantum dot device) and provides feedback to agent A. Agent A calculates the next input parameters after evaluating the effects of the previous input according to a reward function r t , which is expressed in terms of the system state.
More explicitly, the ground state of each quantum dot is tuned with respect to the others by external top metal gates (not shown for simplicity in the sketch of the environment E in Figure 1a),while the coupling between two neighboring quantum dots is in turn controlled by additional barrier control gates. The idealized physical system is prepared so that the ground states of the three quantum dots have the same energy. As the reference energy is arbitrary, we can without any loss of generality set E 1 = E 2 = E 3 = 0. The Hamiltonian therefore reads: (1) One of the three eigenstates is special in that it is expressed as a function of the state of the first and third dots only and reads where θ 1 = arctan(Ω 12 /Ω 23 ). A suitable time evolution of the values of the coupling terms Ω 12 (t) and Ω 23 (t) between the dots allows to transform |D 0 from |1 at t = 0 to |3 at t = t max . If the Hamiltonian is prepared in|D 0 at t = 0, it will remain in the same eigenstates are explicitly expressed in the S.I.. The effectiveness of the pulse sequence of the barrier control gate, which is reflected on the coupling terms Ω i,i+1 , is addressed by taking Ωmax . The remarking fact is that the two pulses must be applied in the so-called counter-intuitive sequence, with the meaning that the first gate controlling Ω 12 (t) is operated as second in time, while the second gate acting on Ω 23 (t) as first. Such pulse sequence drives the occupation of the first dot ρ 11 (t) to zero and that of the last dot ρ 33 (t) to 1, while maintaining the central dot empty (ρ 22 (t) = 0) [4]. It is worth mentioning that recently a different ansatz combination of pulse shapes has been proposed to speed up such process [25]. Generally speaking, there is no analytical method to optimize the pulses, so further improvements are based only on still undiscovered ideas. Here is where the power of the deep reinforcement learning comes into play. Thanks to a robust definition of the reward function that allows the agent to judge its performance, the neural network evolves in order to obtain the best temporal evolution of the coupling parameters Ω i,i+1 , so to ensure that the wanted effects on the electron population across the device evolved over time. The definition of the best reward function is certainly the most delicate choice in the whole model. The key features of the reward can be summarized by two properties, namely its generality, so it should not contain specific information on the characteristics of the two pulses, and its expression as a function of the desired final state, which in our case consists of maximizing ρ 33 at the end of the temporal evolution. The rewards functions used in this research are fully accounted for in the S.I.. The reward function used in most simulations is with α, β > 0. The sum of the first three terms is non-positive at each time step, so the agent will try to bring it to 0 by minimizing ρ 22 and by maximizing ρ 33 . Subtracting e βρ 22 (e.g., punishing electronic occupation in the site 2) improves the convergence of the learning.
Furthermore, in some specific cases, we stop the agent at an intermediate episode if ρ 33 is greater than an arbitrary threshold ρ th 33 for a certain number of time step. This choice can help to find fast pulses that achieve high-fidelity quantum information transport, at the cost of a higher ρ max 22 with respect to the analytic pulses (more details are given in S.I.). after about four times more, the agent achieves a high-fidelity CTAP. Notice that the pulse sequence reminds the ansatz Gaussian counterintuitive pulse sequence as the second gate acting on Ω 23 is operated first, but the shape of the two pulses differs. It is remarkable that the reward function implicitly asks to achieve the result as quickly as possible, resulting in a pulse sequence that is significantly faster than the analytical Gaussian case, and comparable to the recent proposal of Ref. [25]. The high point is that the agent achieves such results irrespectively on the actual terms of the Hamiltonian contributing to the master equation.
Therefore, DRL can be applied straightforward to more complex cases for which there is no generalization of the ansatz solutions found for the ideal case, to which we devote the next section.

II. DEEP REINFORCEMENT LEARNING TO OVERCOME DISTURBANCES
We turn now the attention to the behavior of our learning strategy applied to the nonideal scenario in which the realistic conditions typical of semiconductor quantum dots are considered. In particular, we discuss the results produced by DRL when the dot array is affected by detuning between the energy gap of the dots, dephasing and losses. Such three effects do exist, to different degrees, in any practical attempt to implement CTAP of the electron spin in quantum dots: the first one is typically due to manufacturing defects [26], while the last two emerge from the interaction of the dots with the surrounding environment [27,28] involving charge and spin fluctuations [29,30] and noise on the magnetic field [31].
Under such kind of disturbances, neither analytical nor ansatz solutions are available. On the other side, the robustness and generality of the reinforcement-learning approach can be exploited naturally as, from the point of view of the algorithm, it does not differ from the ideal case discussed above. As we could determine the pulse sequence only by adding a second hidden layer H2 of the neural network, we refer from now on to deep reinforcement learning.
Let us consider a system of N = 3 quantum dots with different energy E i , i = 1, 2, 3. We where A k are the Lindblad operators, Γ k the associated rates, and {A, B} = AB + BA is the anticommutator.
We first considered each single dot affected by dephasing and assumed equal the dephasing rate Γ d for each dot. The master equation can be rewritten aṡ We now consider the inclusion of a loss term in the master equation. The loss refers to the information carried by the electron encoded by the spin. While the loss of an electron is a very rare event in quantum dots, requiring the sudden capture of a slow trap [32,33], the presence of any random magnetic field and spin fluctuation around the quantum dots can modify the state of the transferred spin. Losses are described by the operators Γ l |0 k|, k = 1, 2, 3, where Γ l is the loss rate, modeling the transfer of the amplitude from the i-th dot to an auxiliary vacuum state |0 . Figure 3d shows the superior performance of DRL versus analytic Gaussian pulses. Because of the reward function, DRL minimizes transfer time as to minimize the effect of losses. As a further generalization, we applied DRL to the passage through more than one intervening dot, an extension called straddling CTAP scheme (or SCTAP) as from Refs. [4,34]. To exemplify a generalization to odd N > 3, we set N = 5 according to the sketch depicted in Figure 4a, and no disturbance is considered for simplicity.

IV. CONCLUSIONS
Neural networks can discover control sequences of a tuneable quantum system starting its time evolution far from its final equilibrium, without any prior knowledge. Contrary to the employment of special ansatz solutions, deep reinforcement learning discovers novel sequences of control operations to achieve a target state, regardless possible deviations from the ideal conditions. Such method is irrespective from having previously discovered ansatz solutions and it applies when they are unknown. To apply to a practical quantum system known for its counter-intuitiveness, we adopted quantum state transport across arrays of quantum dots, including sources of disturbances such as energy level detuning, dephasing, and loss. In all cases, the solutions found by the agent outperform the known solution in terms of either speed or fidelity or both. Both pre-training of the network and 2TBN analysis -by reducing the computational effort -contribute to speed up the learning process.
In general, we have shown that neural-network based deep reinforcement learning provides a general tool for controlling physical systems by circumventing the issue of finding ansatz solutions when neither a straightforward method nor a brute force approach are possible.

A. Master equation for Hamiltonian including pure dephasing
Pure dephasing [39] is the simplest model that accounts for environment interaction of an otherwise closed system. The corresponding Lindblad operators are diagonal and their form is: In the case γ 1 = γ 2 = γ 3 = Γ, the master equation becomes: If the Hamiltonian is time-independent, then for every i, so the electronic population remains unchanged in presence of pure dephasing.
This implies [40] that the energy of the system remains unchanged during the evolution, since the environment cannot change it.

B. Neural network
The neural network consists of a multi-layered feed-forward network of ( Among those tested, the most effective reward functions r t have, in general, two characteristics: • A punishing term −(1−ρ 33 +ρ 22 ), so that this subtractive term is 0 if ρ 33 = 1, ρ 22 = 0, and is maximum if ρ 33 = 0, ρ 22 = 1, so an agent will try to minimize the difference between ρ 33 and 1 and the difference between ρ 22 and 0.
• A punishing term proportional to e βρ 22 , where β > 0. This term exponentially punishes increments of ρ 22 . Typical values for β are of the order of units.
Note that such two terms are both non-positive, so a "perfect" simulation can have a maximum of cumulative reward equal to 0. The two terms are also differentiable, and this can help the agent to converge to a global maximum, because the algorithm used in this work involves derivatives of the reward function.
The reward functions used for each case are reported in S.I.. In the case of the SCTAP across an array of N = 5 quantum dots, the reward function used in Figure 4 of the main text is: Despite its simplicity, the agent was able to keep ρ 22 = ρ 44 = 0 and ρ 33 close to zero during most of the population transfer.

D. Software
The QuTIP routine has been run in parallel by using GNU Parallel. [42] VII. DATA AVAILABILITY The data that support the findings of this study are available from the corresponding author upon reasonable request.