Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks

Quantum many-body control is a central milestone en route to harnessing quantum technologies. However, the exponential growth of the Hilbert space dimension with the number of qubits makes it challenging to classically simulate quantum many-body systems and consequently, to devise reliable and robust optimal control protocols. Here, we present a novel framework for efficiently controlling quantum many-body systems based on reinforcement learning (RL). We tackle the quantum control problem by leveraging matrix product states (i) for representing the many-body state and, (ii) as part of the trainable machine learning architecture for our RL agent. The framework is applied to prepare ground states of the quantum Ising chain, including states in the critical region. It allows us to control systems far larger than neural-network-only architectures permit, while retaining the advantages of deep learning algorithms, such as generalizability and trainable robustness to noise. In particular, we demonstrate that RL agents are capable of finding universal controls, of learning how to optimally steer previously unseen many-body states, and of adapting control protocols on-the-fly when the quantum dynamics is subject to stochastic perturbations. Furthermore, we map the QMPS framework to a hybrid quantum-classical algorithm that can be performed on noisy intermediate-scale quantum devices and test it under the presence of experimentally relevant sources of noise.


I. INTRODUCTION
Quantum many-body control is an essential prerequisite for the reliable operation of modern quantum technologies which are based on harnessing quantum correlations. For example, quantum computing often involves high-fidelity state manipulation as a necessary component of most quantum algorithms [1,2]; In quantum simulation, the underlying AMO platforms require to prepare the system in a desired state before its properties can be measured and studied [3][4][5]; And quantum metrology relies on the controlled engineering of (critical) states to maximize the sensitivity to physical parameters [6,7]. Controlling many-body systems can also be considered in its own right, as a numerical tool which offers insights into concepts such as quantum phases and phase transitions [8]. Moreover, it can reveal novel theoretical phenomena such as phase transitions in the control landscape [9], and bears a direct relation to our understanding of quantum complexity [10].
Compared to single-and few-particle physics, working in the quantum many-body domain introduces the formidable difficulty of dealing with an exponentially large Hilbert space. A specific manifestation is the accurate description and manipulation of quantum entan-glement shared between many degrees of freedom. This poses a limitation for classical simulation methods, since memory and compute time resources scale exponentially with the system size.
Fortunately, there exists a powerful framework to simulate the physics of one-dimensional (1d) quantum many-body systems, based on matrix product states (MPS) [11][12][13][14]. MPS provide a compressed representation of many-body wave functions and allow for efficient computation with resources scaling only linearly in the system size for area-law entangled states [15,16].
In this work, we develop a new deep RL framework for quantum many-body control, based on MPS in two com-  Many-body control studies in the ground state phase diagram of the quantum Ising model, analyzed in this work: an RL agent is trained to prepare a ground state of the transverse field Ising model from random initial states (Ctrl Study A, magenta), the z-polarized product state from a class of paramagnetic ground states (Ctrl Study B, green), and a ground state in the critical region of the mixed field Ising model from paramagnetic ground states of opposite interaction strength (Ctrl Study C, cyan). The optimized agent outputs a control protocol as a sequence of operatorŝ Aj, which time evolve the initial spin state into the desired target state (marked by a star).
plementary ways. First, we adopt the MPS description of quantum states: this allows us to control large interacting 1d systems, whose quantum dynamics we simulate within the RL environment. Second, representing the RL state in the form of an MPS, naturally suggests the use of tensors network as (part of) the deep learning architecture for the RL agent, e.g., instead of a conventional neural network (NN) ansatz. Therefore, inspired by earlier examples of tensor-network-based machine learning [58][59][60], we approximate the RL agent as a hybrid MPS-NN network, called QMPS. With these innovations at hand, the required computational resources scale linearly with the system size, in contrast to learning from the full many-body wave function. Ultimately, this allows us to train an RL agent to control a larger number of interacting quantum particles, as required by present-day quantum simulators.
As a concrete example, we consider the problem of state preparation and present three case studies in which we prepare different ground states of the paradigmatic mixed field Ising chain [ Fig. 1]. We train QMPS agents to prepare target states from a class of initial (ground) states, and devise universal controls with respect to ex-perimentally relevant sets of initial states. In contrast to conventional quantum control algorithms (such as CRAB or GRAPE [17,18,61]), once the optimization is complete, RL agents retain information during the training process in form of a policy or a value function. When enhanced with a deep learning architecture, the learned control policy generalizes to states not seen during training. We demonstrate how this singular feature of deep RL allows our agents to efficiently control quantum Ising chains (i) starting from various initial states that the RL agent has never encountered, and (ii) in the presence of faulty or noisy controls and stochastic dynamics. Thus, even in analytically intractable many-body regimes, an online RL agent produces particularly robust control protocols.

II. QUANTUM MANY-BODY CONTROL
Consider a quantum many-body system in the initial state |ψ i . Our objective is to find optimal protocols that evolve the system into a desired target state |ψ * . We construct these protocols as a sequence of q consecutive unitary operators U (τ ) = q j=1 U τj , where U τj ∈ A are chosen from a set A. To assess the quality of a given protocol, we compute the fidelity of the evolved state w.r.t. the target state: Throughout the study, we focus on spin-1/2 chains of size N with open boundary conditions. The system on lattice site j is described using the Pauli matrices X j , Y j , Z j . As initial and target states we select area-law states, e.g., ground states of the quantum Ising model [Sec. III]. In order to control chains composed of many interacting spins, we obtain the target ground state using DMRG [11,13], and represent the quantum state as an MPS throughout the entire time evolution [Supplemental Material:, Sec. S.2 A].
We choose a set of experimentally relevant control unitaries A which contains uniform nearest-neighbor spinspin interactions, and global rotations: Two-qubit unitaries are capable of controlling entanglement in the state. Note that MPS-based time evolution is particularly efficient for such locally applied operators and the resulting protocols can be considered as a series of quantum gates. The time duration (or angle) δt ± of all unitary operators is fixed and slightly different in magnitude for positive and negative generatorsÂ j , and kept constant throughout the time evolution. Hence, the problem of finding an optimal sequence reduces to a discrete combinatorial optimization in the exponentially large dimensional space of all possible sequences: For a fixed sequence length q, the number of all distinct sequences is |A| q ; therefore, a brute-force search quickly becomes infeasible and more sophisticated algorithms, such as RL, are needed. By fixing both q and δt ± prior to the optimization, in general, we may not be able to come arbitrarily close to the target state, but these constraints can be easily relaxed.

III. STATE INFORMED MANY BODY CONTROL
Our MPS-based RL framework is specifically designed for preparing low-entangled states in 1d, such as ground states of local gapped Hamiltonians. Hence, in the subsequent case studies we consider ground states of the 1d mixed field Ising model as an exemplary system: where g x (g z ) denotes a transverse (longitudinal) field. In the case of negative interaction strength and in the absence of a longitudinal field g z = 0, the system is integrable, and has a critical point at g x = 1 in the thermodynamic limit, separating a paramagnetic (PM) from a ferromagnetic phase (FM) [ Fig. 1]. For g z > 0, the model has no known closed-form expressions for its eigenstates and eigenenergies. In addition, for positive interactions, the phase diagram features a critical line from (g x , g z ) = (1, 0) to (g x , g z ) = (0, 2) exhibiting a transition from a paramagnetic to an antiferromagnetic phase (AFM) [see Supplemental Material: Sec. S.1 for a brief introduction to quantum many-body physics and phase transitions].
In the rest of this section we will analyze three different control scenarios involving ground states of the mixed field Ising model. In Sec. III A we consider the problem of universal state preparation for N = 4 spins and train a QMPS agent to prepare a specific target ground state starting from arbitrary initial quantum states. This example will serve as a first benchmark of the QMPS framework. In Sec. III B we use the QMPS algorithm to prepare a spin-polarized state starting from a class of paramagnetic ground states which shows that our approach produces reliable protocols in the many-body regime. Finally, in Sec. III C we consider N = 16 spins and target a state in the critical region of the mixed field Ising model demonstrating that the QMPS framework can also be employed for highly non-trivial control tasks such as critical state preparation. Furthermore, we show that the obtained QMPS agent has the ability to self-correct protocols in the presence of noisy time evolution.
A. Universal ground state preparation from arbitrary initial quantum states for N = 4 spins In the noninteracting limit, J = 0, the QMPS agent readily learns how to control a large number of spins [Supplemental Material:, Sec. S.3 A 1]. Instead, as a nontrivial benchmark of the QMPS framework, here we teach an agent to prepare the ground state of the 4-spin transverse field Ising model at (J = −1, g x = 1, g z = 0), starting from randomly drawn initial states. While this control setup can be solved using the full wave function and a conventional neural network ansatz [see Supplemental Material: Sec. S. 3 A 2], the uniform initial state distribution over the entire continuous Hilbert space creates a highly non-trivial learning problem and presents a first benchmark for our QMPS framework. Moreover, system sizes of N ∼ 4 spins already fall within the relevant regime of most present-day studies using quantum computers, where gate errors and decoherence currently prevent exact simulations at larger scales [2,62,63].
We first train an agent (QMPS-1) to prepare the target ground state within 50 protocol steps or less, setting a many-body fidelity threshold of F * ≈ 0.85. The initial states during training are chosen to be (with probability p = 0.25) random polarized product states, or (with probability p = 0.75) random reflection-symmetric states drawn from the full 2 4 = 16 dimensional Hilbert space [64]. In this way the QMPS-1 agent has to learn to both disentangle highly entangled states to prepare the Ising ground state, but also to appropriately entangle product states to reach the entangled target [the learning curves of the QMPS-1 agent are shown in Supplemental Material:, Sec. S.3 A]. After this training stage, we test the QMPS-1 agent on a set of 10 3 random initial states and find that in ∼ 99.8% of the cases the fidelity threshold is successfully reached within the 50 allowed steps. A (much) better fidelity cannot be achieved by the QMPS-1 agent alone, due to the discreteness of the action space and the constant step size used, rather than limitations intrinsic to the algorithm. Note that when following a conventional approach of training a neural network directly on the quantum wave function, we were not able to match the performance of the QMPS agent given the same number of parameters and training episodes [see Supplemental Material: Sec. S.3 A 2 for details]. This suggests that the QMPS architecture has a more natural structure for extracting relevant features from quantum state data and can already be advantageous for small system sizes.
To improve the fidelity between the final and the target state, we now train a second, independent agent (QMPS-2) with a tighter many-body fidelity threshold of F * ≈ 0.97. The initial states are again sampled randomly as mentioned above; however, we first use the already optimized QMPS-1 agent to reach the vicinity of the target state within F > 0.85. Then, we take those as initial states for the training of the second QMPS-2 agent.
This two-stage learning schedule can in principle be Universal four-qubit control -(a) Achieved many-body fidelityF between final and target state during training averaged over 100 training episodes (dark-blue curve). The best and worst fidelity within each episode window is indicated by the light-blue-shaded area. The fidelity threshold, F * ∼ 0.97, is marked by a gray dashed line. The inset shows the mean number of episode stepsT during training (averaged over 100 episodes). The maximum number of allowed steps is set to 50. (b)-(e) Two QMPS agents [see text] are trained with fidelity thresholds F * ∼ 0.85, 0.97 (gray dashed lines), to prepare the Ising ground state (J = −1, gx = 1, gz = 0), starting from (b) the z-polarized product state, (c) the GHZ state, (d) an Ising antiferromagnetic ground state at (J = +1, gx = gz = 0.1), and (e) a -random state. The QMPS-2 agent starts from the final state reached by the QMPS-1 agent (purple shaded area). The many-body fidelity F between the instantaneous and the target state, is shown in the lower part of each panel. The upper part of the panels shows the control protocol. The colors and shading of each rectangle indicate the applied action A (see legend); ± stands for the sign of the action generator, i.e., exp(±iδt±Â). The QMPS-1,2 agents use fixed time steps of δt± = (π/8, π/16)+, (π/13, π/21)−, indicated by action rectangles of different sizes in the protocol. N = 4 spins.
continued to increase the fidelity threshold even further. The learning curves of the QMPS-2 optimization are shown in Fig. 2(a). In Fig. 2(b)-(c) we present the obtained protocols for four exemplary initial states. Overall, the combined two-agent QMPS is able to reach the fidelity threshold of F * ∼ 0.97 for ∼ 93% of the randomly drawn initial states within the 50 episode steps that were imposed during training. We emphasize that this result is already nontrivial, given the restricted discrete action space, and the arbitrariness of the initial state.
Let us now exhibit two major advantages of RL against conventional quantum control algorithms. (i) After training we can double the allowed episode length for each agent to 100 steps. Since this allows for longer protocols, we find that the target state can be successfully prepared for 99.5% of the initial states (compared to the previously observed 93%). Note that this feature is a unique advantage of (deep) RL methods, where the policy depends explicitly on the quantum state: during training, the agent learns how to take optimal actions starting from any quantum state and hence, it is able to prepare the target state if it is given sufficient time. Moreover, (ii) in this example we achieve universal quantum state preparation, i.e., the trained RL agent succeeds in preparing the target state irrespective of the initial state. This is not possible with conventional control techniques where the optimal protocol is usually tailored to a specific initial state, and the optimization has to be rerun when starting from a different state. In Sec. IV, we show how the trained QMPS architecture can be implemented to apply the RL agent on NISQ devices. In general, a (Haar) random quantum many-body state is volume-law entangled and, hence, it cannot be approximated by a MPS of a fixed bond dimension. Moreover, it becomes increasingly difficult to disentangle an arbitrarily high entangled state for larger system sizes [65]. Therefore, when working in the truly quantum manybody regime, we have to restrict to initial and target states that are not volume-law entangled.
As an example, here we consider a many-body system of N = 32 spins and learn to prepare the z-polarized state from a class of transverse field Ising ground states (J = −1, g z = 0). Once high-fidelity protocols are found, they can be inverted to prepare any such Ising ground state from the z-polarized state, which presents a relevant experimental situation [Supplemental Material:, Sec. S.3 B]. Many-body ground state preparation is a prerequisite for both analog and digital quantum simulation, and enables the study of a variety of many-body phenomena such as the properties of equilibrium and nonequilibrium quantum phases and phase transitions.
To train the agent, we randomly sample initial ground states on the paramagnetic side of the critical point: 1.0 < g x < 1.1. The difficulty in this state preparation task is determined by the parameter g x defining the ini- tial state: states deeper into the paramagnetic phase are more easy to 'rotate' into the product target state, while states close to the critical regime require the agent to learn how to fully disentangle the initial state in order to reach the target. We train a QMPS agent on a system of N = 32 spins which is infeasible to simulate using the full wavefunction, and is far out-of-reach for neuralnetwork based approaches. We set the single-particle fidelity threshold [cf. Sec. V] to F * sp = N √ F * = 0.99 (corresponding many-body fidelity F * ∼ 0.72) and allow at most 50 steps per protocol. Figure 3(b) shows the successfully reached final fidelity when the trained QMPS agent is tested on unseen initial states, for various values of g x . First, notice that the agent is able to prepare the target state also for initial states with g x > 1.1 that lie outside of the training region (dashed vertical lines). Hence, we are able to extrapolate optimal control protocols well beyond the training data distribution, without additional training. Similar generalization capabilities have already been demonstrated for supervised learning tasks such as Hamiltonian parameter estimation [66]. However, this is not true for states inside the critical region, g x 1, and in the ferromag-netic phase (g x 1); such a behavior is not surprising, since these many-body states have very different properties compared to those used for training. Note that in contrast to the previous control study of Sec. III A, the initial training data states are not i.i.d. over the full 2 32 dimensional Hilbert space as we only train on PM ground states of the Ising model. Therefore the agent cannot be expected to generalize to arbitrary initial states in this case. Interestingly, it follows that the onset of criticality can be detected in the structure of control protocols, as the number of required gates (actions) and, in particular, of entangling unitaries, increases rapidly as one approaches the critical point [ Fig. 3(c)].
Discontinuities in the achieved fidelity [ Fig. 3(b)] arise due to the fixed, constant step size δt ± : we observe distinct jumps in the final fidelity, whenever the length of the protocol sequence increases. This is a primary consequence of the discrete control action space. Its physical origin can be traced back to the need for a more frequent use of disentangling two-site unitaries, for initial states approaching the critical region. Figure 3(a) shows the optimal protocol at g x = 1.01: first, the agent concatenates threeŶ -rotations (δt + = π/12) in a global gate, which shows that it learns the orientations of the initial x-paramagnet and the z-polarized target [yellow shaded region]. This is succeeded by a non-trivial sequence containing two-body operators. A closer inspection [ Fig. S10 in Supplemental Material:, Sec. S.3 B and Video 1] reveals that the agent discovered a generalization of Euler-angle rotations in the multi-qubit Hilbert space [blue shaded region]. This is remarkable, since it points to the ability of the agent to construct compound rotations, which is a highly non-trivial combinatorial problem for experimentally relevant constrained action spaces. This can be interpreted as a generalization of dynamical decoupling sequences introduced in stateof-the-art NMR experiments and used nowadays in quantum simulation, optimal quantum sensing, and to protect quantum coherence [67][68][69]. We verified that this protocol is a local minimum of the control landscape.
We also investigated the system size dependence of optimal protocols in this control study. To our surprise, we find that agents trained on the N = 32 spin system produce optimal protocols that perform reasonably well on smaller (N = 8) as well as larger (N = 64) systems. Hence, this control problem admits a certain degree of transferability, which worsens for initial states closer to the finite-size dominated critical region [Supplemental Material:, Sec. S. 3 B].
The MPS-based control framework enables us to readily analyze the physical entanglement growth during training, via the bond dimension of the quantum state χ ψ . The protocol exploration mechanism in QMPS causes the agent to act mostly randomly during the initial stages of learning. This translates to random sequences of unitary gates that can lead to an increase of quantum entanglement [ Fig. 3(c), inset]. In our simulations, we set the maximum allowed bond dimension to χ ψ = 16, which is sufficient for the considered initial and target states to be approximated reliably. However, not all states encountered can be represented with such a small bond dimension, as reflected by large truncation errors during training [Supplemental Material:, Sec. S.3 B]. Nonetheless, as training progresses, the agent learns to take actions that do not create excessive entanglement [Fig. 3(c)]. Therefore, the truncation error naturally decreases, as training nears completion. As a consequence, the final converged protocols visit states that lie within a manifold of low-entangled states. Moreover, increasing χ ψ does not change these observations. We believe that this mechanism relies on the area-law nature of the initial and target states, and we expect it to open up the door towards future control studies deeper in the genuine many-body regime. States in the critical region possess non-trivial correlations and show strong system-size dependence, which make manipulating them highly non-trivial. In partic-ular, the required time duration to adiabatically prepare critical states diverges with the number of particles, whereas sweeping through critical points reveals properties of their universality classes [70]. Therefore, finding optimal control strategies away from the adiabatic limit is an important challenge. Critical state preparation is also of practical relevance for modern quantum metrology, where the enhanced sensitivity of critical states to external fields is leveraged to perform more precise measurements [6].
Our final objective is to prepare a ground state in the critical region of the mixed field Ising chain (J = +1, g x = 0.5, g z = 1.5) starting from non-critical paramagnetic ground states of the same model with flipped interaction strength: J = −1, 1.0 < g x < 1.5, 0 < g z < 0.5 [ Fig. 1]. Hence, the agent has to learn to connect ground states of two distinct Hamiltonians. This scenario is often relevant in typical experimental setups where only a single-sign interaction strength can be realized: e.g., the initial state comes from the J < 0 Ising model, while the ground state of interest belongs to the antiferromagnatic Ising Hamiltonian. In general, one can use two completely distinct parent Hamiltonians for the initial and target states, one of which being inaccessible in the quantum simulator platform at hand, while the other being the object of interest.
We train our QMPS agent on N = 16 spins with a single-particle fidelity threshold of F * sp = 0.97 (F * ∼ 0.61), and a maximum episode length of 50. Figure 4(a) shows the achieved fidelity between the target state and the final state, for different initial ground states corresponding to a rectangular region in the (g x , g z )-plane. Notice that the agent is able to generalize to unseen initial states lying far outside the training region (white rectangle), and fails only close to the critical point of the transverse field Ising model (g x = 1, g z = 0) and for a few isolated initial states well outside of the training region.
We now demonstrate that our QMPS agent shows remarkable generalization capabilities in noisy environments.
In particular, we analyze how robust the trained QMPS agent is to stochastic perturbations in the time evolution of the state -a common problem in noisy intermediate-scale quantum (NISQ) computing devices [71]. In what follows, we consider two different sources of noise independently: (i) At each time step, with probability , a random action rather than the selected one is enforced. This type of noise mimics bitor phase-flip errors, which occur in quantum computing.
(ii) Gaussian random noise with zero mean and standard deviation σ, is added to the time duration δt ± of each unitary operator; this can, for instance, result from imperfect controls in the experimental platform.
Noise type (i) is equivalent to using an -greedy policy. Hence, the states encountered when acting with such a policy, could have (in principle) been visited during training. Due to the generalization capabilities of RL, it is reasonable to expect that an agent will act optimally after non-optimal actions have occurred, attempting to correct the 'mistake'. In Fig. 4(c)-(d) we show the achieved final fidelity [top] and the required number of steps [bottom] for = 0.02. Overall, the fidelity threshold can still be reached in the majority of test cases. The randomness typically results in longer protocols indicating that the agent indeed adapts to the new states encountered. Interestingly, in the noise-free case the agent fails to prepare the target state for a few points outside the training region [orange points in Fig. 4(a)]; this can be attributed to incorrectly estimated Q-values which have not fully converged to the optimal ones outside of the training interval. However, when adding the perturbation, the agent is able to correct its mistake in one of the shown instances and prepares the target state successfully [ Fig. 4(c)].
Recall that we use a different time step δt ± for positive and negative actions. This way the agent is not just able to undo a non-optimal action by performing its inverse; it rather has to adjust the entire sequence of incoming unitaries in a non-trivial way. The ability of the QMPS agent to adapt is demonstrated in Fig. 4(g) where we plot the fidelity during time evolution starting from an arbitrary initial ground state. At time step t = 5, we perturb the protocol by taking 6 different actions, and let the agent act according to the trained policy afterwards; this results in 6 distinct protocol branches. In each of them, the agent tries to maximize the fidelity and successfully reaches the fidelity threshold after a few extra protocol steps [see also Video 2]. In Supplemental Material:, Sec. S.3 C we provide further examples showing that this behavior is generic, and can also be observed for different initial states.
In contrast to the -noise, adding Gaussian random noise σ, (ii), to the time step duration δt ± , results in states that the agent has not seen during training. This source of noise, therefore, explicitly tests the ability of the agent to generalize beyond the accessible state space, and in particular to interpolate between quantum manybody states. Fig. 4(e)-(f) displays the achieved fidelity and the corresponding protocol length for σ = 0.01. We find that the QMPS agent is also robust to this type of noise. In Fig. 4(h) we plot the fidelity trajectories starting from the same initial state using 5 different random seeds; this illustrates that our agent adapts successfully to previously unencountered many-body states, and steers the protocol on-line to reach beyond the fidelity threshold [see Video 3].
The robustness of QMPS agents to noise and, in general, to stochasticity in the quantum gates, demonstrates yet another advantage of deep RL methods over conventional quantum control techniques. The latter typically perform suboptimal in noisy systems since the optimization does not take into account the quantum state information during the time evolution, and the optimal protocols are specifically optimized for a fixed trajectory of quantum states [27]. By contrast, QMPS value functions are optimized on a large class of states and, as shown above, can interpolate and extrapolate to new, seen and unseen states as long as the deep learning approximation stays sufficiently accurate. Therefore, unlike conventional quantum control algorithms, QMPS agents have the ability to automatically self-correct their protocols on-the-fly, i.e., while the system is being time evolved.

IV. IMPLEMENTATION ON NISQ DEVICES
The present QMPS framework requires the quantum state to be accessible at each time step for both training and inference purposes; yet, quantum states are not observable in experiments without performing expensive quantum state tomography. On the other hand, MPSbased quantum state tomography presents a possible and efficient way of retrieving the quantum state in a form that can be straightforwardly integrated in the QMPS framework [72][73][74]. Alternatively, there already exist efficient encoding strategies that map MPS into quantum circuits [75][76][77][78][79][80][81]. Moreover, several proposals were recently developed in which MPS are harnessed for quantum machine learning tasks, for example as part of hybrid classical-quantum algorithms [82][83][84] or as classical pre-training methods [85,86].
Similar ideas can be applied to the QMPS architecture Figure 5. QMPS circuit framework as a hybrid quantum-classical algorithm. On the quantum device, we first prepare the initial state and apply the already inferred protocol actions as gates. In the example above, the initial state is the fully z-polarized state, i.e., |0 ⊗4 , and two actions are performed: A global rotation aroundX followed by a global two-qubitŶŶ rotation. The resulting state |ψ represents the input to the QMPS network. The QMPS tensors θQ = A [1] · · · A [N ] can be mapped to unitary gates on a quantum circuit. To compute the Q-values, we first apply the inverse of the QMPS circuit unitary U † θ and measure the output in the computational basis. The fraction of all-zero measurement outcomes is an approximation to the fidelity | θQ|ψ | 2 . Note that this denotes the fidelity with respect to the Q-value network state |θQ and not the target quantum state which is not required during protocol inference. The fidelity estimates are then fed into the neural network on a classical computer. From the resulting Q-values we can infer the next action and repeat these steps until the target state is reached.
by mapping the trainable MPS to a parametrized quantum circuit, thus directly integrating the QMPS framework in quantum computations with NISQ devices and hence, eliminating the need for quantum state tomography. This allows us to perform the expensive training routine on readily available classical computers while the inexpensive inference step can be performed on quantum hardware.
The mapping of the QMPS to a quantum circuit is described in detail in Supplemental Material:, Sec. S.4 A. Figure 5 shows the resulting QMPS circuit framework for the case of N = 4 spins/qubits in which the original QMPS state θ Q is represented as unitary gates (purple boxes). To calculate the Q-values Q θ (ψ, a) given an input quantum state |ψ , we first compute the fidelity between the input and the QMPS state which can be obtained via sampling on a quantum computer. Alternatively, the overlap can also be accessed by performing a SWAP test, albeit requiring additional ancilla qubits and non-local gates [87,88]. The computed fidelities for each QMPS circuit are then fed into the classical neural network giving rise to a hybrid quantumclassical machine learning architecture as shown in Fig. 5. If necessary, the parameters of the QMPS circuit U θ can be fine-tuned by performing some additional optimization.
We test the QMPS circuit framework on the universal ground state preparation task of Section III A. In what follows we only report results for the QMPS-1 agent trained on a fidelity threshold of F * ∼ 0.85; the generalization to include the QMPS-2 agent is straightforward. We translate the optimized QMPS to the corresponding quantum circuit and investigate the effects of noise in the quantum computation on the QMPS framework.
To simulate incoherent errors, we consider a depolarizing noise channel E and apply it after each action and QMPS gate. Here, ρ denotes the quantum state density matrix and λ is the depolarizing noise parameter which is set to λ 1 = 10 −4 for all single-qubit gates. We plot the success rate as a function of the two-and three-qubit gate errors λ 2/3 for 1000 randomly sampled initial states in Fig. 6(a) (purple line). For error rates λ 2/3 < 10 −3 the QMPS agent is able to successfully prepare the target state in almost all runs. However, the performance deteriorates with increasing error parameter λ 2/3 . Let us note that we have used the same error rate for both, two-and three-qubit gates. On a physical quantum device, the three-qubit gate will be decomposed into a sequence of two-qubit gates and hence the introduced noise will be amplified. However, the decomposition, the resulting circuit depth, and the T Figure 6. Universal four-qubit control -(a) We sample 1000 random initial states and apply the QMPS circuit framework in the presence of depolarizing noise with error parameter λ 2/3 for all two-and three-qubit gates. Note that the noise parameter for all single-qubit gates is always fixed to λ1 = 10 −4 . We set the number of measurement shots to 4096 and plot the percentage of runs in which the target state is successfully reached against the noise strength. The success rate under exact, noise-free computation (without sampling) is shown as a black dashed line. The success probability when acting completely random is always zero. We provide both, the results for the full χ = 4 QMPS circuit (purple solid line) and the truncated χ = 2 QMPS (orange dash-dotted line). Insets: The corresponding average number of protocols steps T as a function of the noise strength λ. The standard deviation is indicated by the shaded areas. (b) Same as in (a) except that we start each protocol from a fixed initial state: the fully z-polarized state (green solid line), the GHZ state (blue dashed line), and a ground state of the mixed field Ising model at J = +1, gx = gz = 0.1 (red dashed-dotted line). To compute the success rates and the average protocol length we average over 500 different runs. Note that the x-axis scale is shifted by one order of magnitude compared to (a).
sources of noise vary with the hardware type [89]. Thus, we chose the simplified noise model of Eq. (5) in order to study the QMPS circuits in a hardware agnostic way. We also report the results obtained when truncating the bond dimension χ = 4 QMPS to a χ = 2 QMPS which gives rise to at most two-qubit gates in the final circuit. In this case the success probabilities [orange dashed line in Fig. 6(a)] do not reach 100% even for small error rates. This indicates that a bond dimension of χ = 4 is indeed required to faithfully represent the QMPS state.
Finally, in Fig. 6(b) we show the success rates when starting from three different initial states: the fully zpolarized state (green), the GHZ state (blue), and a ground state of the mixed field Ising model at J = +1, g x = g z = 0.1 (red). The success probability of unity can be maintained for error rates λ one order of magnitude larger than for the random initial state case. Physical states such as the GHZ or ground states possess only a small amount of entanglement and hence allow us to prepare the target state using a relatively small number of two-qubit gates [c.f. Fig. 2(b)-(c)]. Thus, the resulting protocols are automatically more robust to two-qubit gate errors.
The effect of other decoherence channels (amplitude and phase damping) on the QMPS circuit framework leads to qualitative similar results and is further discussed in Supplemental Material: Sec. S.4. Furthermore, we also analyze the self-correcting property of the agent in the presence of coherent gate errors.
Let us briefly discuss the bottlenecks of the current QMPS circuit framework. The QMPS circuit depicted in Fig. 5 is composed of a three qubit-gate and more generally an MPS with bond dimension χ = 2 n will naturally give rise to (n + 1)-qubit gates. While three-qubit gates will likely be implemented in near-term quantum computers, the gates realized in current NISQ devices are commonly composed of at most two-qubit gates. Hence, any gates acting on more than two qubits would need to be decomposed into two-qubit gates which usually gives rise to deep circuits. Instead, we can use alternative MPS-to-circuit mappings that would lead to at most two-qubit gates in the final circuit [see Supplemental Material: Sec. S.4 A for a detailed discussion] [75][76][77][78][79][80][81]. Finally, the sampling of the fidelity in Eq. (4) requires a number of measurement shots that could in principle grow exponentially in the system size. One possible solution is to choose a different Q-value network ansatz such as a matrix product operator [see Supplemental Material: Sec. S.2 D]. The resulting computation can then be interpreted as measuring an observable which can be performed efficiently on larger systems. We find that for the specific N = 4 QMPS example, the number of measurements required for successfully preparing the target state can be chosen relatively small, i.e., ∼ 500 to reach success rates close to unity [see Fig. S18(a) in Supplemental Material: Sec. S.4 and corresponding discussion].

V. DISCUSSION/OUTLOOK
In this work we introduced a tensor network-based Q-learning framework to control quantum many-body systems. Incorporating an MPS into the deep learning architecture allows part of the Q-value computation to be efficiently expressed as an overlap between two MPS wave functions. As a result, larger system sizes can be reached compared to learning with the full wave func-tion. We emphasize that standard RL algorithms with conventional neural network architectures cannot handle quantum many-body states, whose number of components scale exponentially with the number of spins: e.g., for N = 32 spins, there are 2 32 ≈ 10 10 wavefunction components to store which is prohibitively expensive. By contrast, our MPS learning architecture only requires linear scaling with the system size N . Furthermore, we found that the hyperparameters of the optimization and, in particular, the number of training episodes do not require finetuning with the system size, and stayed roughly constant [Supplemental Material:, Sec. S.2 C]. Summarizing, QMPS proposes the use of a tensor-network variational ansatz inspired by quantum many-body physics to offer a novel RL learning architecture.
QMPS-based RL is designed for solving the quantum many-body control problem by learning a value function that explicitly depends on the quantum state. Therefore, a successfully trained QMPS agent is capable of devising optimal protocols for a continuous set of initial states, and selects actions on-the-fly according to the current state visited. As a result, a QMPS agent has the ability to self-correct mistakes in the protocols when the dynamics is stochastic, even before the protocols have come to an end. Moreover, we illustrated that the agent can interpolate and extrapolate to new quantum states not seen during training. Remarkably, we observed this behavior over test regions several times the size of the training region. To the best of our knowledge, there does not exist a quantum control algorithm that exhibits such desirable features, as these are based on deep learning capabilities: conventional quantum control algorithms require to re-run the optimization when the initial state has been changed, and thus lack any learning capabilities.
The generalization capabilities, the robustness to noise, and the feasibility of universal state preparation (for small system sizes) are advantages of the QMPS framework over competitive optimal control algorithms. These features are especially relevant for experiments and modern quantum technologies that heavily rely on quantum many-body control, and in particular for NISQ devices. Moreover, we demonstrated that the present QMPS framework can be integrated in quantum device simulations by mapping the optimized MPS ansatz to gates in a quantum circuit. The resulting hybrid quantum-classical algorithm allows us to control quantum states directly on the device without the need of performing expensive quantum state tomography. Thus, unlike neural networks, using an MPS learning architecture also facilitates the use of RL-agents on NISQ devices.
Our work opens up the door to further research on tensor network-based RL algorithms for quantum (manybody) control. Due to the modular structure of the architecture, the QMPS can be replaced by various tensor networks, such as tree tensor networks (TTN) [90] or the multi-scale entanglement renormalization ansatz (MERA) [91]; these would allow different classes of states to be represented efficiently, and affect the expressivity of the ansatz. Moreover, the infinite-system size description of iMPS can be used to devise control strategies in the thermodynamic limit for which an efficient mapping to quantum circuits exist as well [80]. Similarly, systems with periodic boundary conditions can be studied [81]. Furthermore, tensor networks come with a comprehensive toolbox for analyzing their properties, such as the entanglement structure and correlations. Hence, tensor-network-based reinforcement learning will enable us to study the data, the training, and the expressivity of the ansatz using well-understood concepts from quantum many-body physics [92,93].
Finally, we mention that it is straightforward to use RL algorithms other than Q-learning in conjunction with our MPS-based ansatz. While we chose the DQN framework since it is off-policy and, therefore, more data-efficient compared to policy gradient methods [Sec. S.2 B], the latter would more naturally allow for continuous action spaces. In turn, with continuous controls, target states can be reached with higher fidelity [31]. One can also adapt the reward function and, for instance, consider the energy density or various distance measures beyond the fidelity. Acknowledgments -We wish to thank Thomas Busch and Pankaj Mehta for stimulating discussions. M.B. was supported by the Marie Sklodowska-Curie grant agreement No 890711, and the Bulgarian National Science Fund within National Science Program VIHREN, contract number KP-06-DV-5 (until 25.06.2021). The authors are pleased to acknowledge that the computational work reported on in this paper was performed on the MPI PKS and OIST HPC clusters. We are grateful for the help and support provided by the Scientific Computing and Data Analysis section of Research Support Division at OIST. For the tensor network simulations we used the tensornetwork library [94]. For performance enhancements via just-in-time compilation we employed jax [95]; tenpy [96] and quspin [97] were used for benchmarking and testing. Data availability -The data used in the figures are available upon reasonable request. The trained model parameters and corresponding learning curve data are available on GitHub [98]. Code availability -The software code can be accessed on GitHub [98]. VIDEOS Video 1 -Transverse-field Ising control Bloch sphere trajectory of the reduced density matrix of a single spin in the bulk (i = 15) when starting from the initial ground state at g x = 1.01 and acting according to the optimal QMPS protocol. Also shown are the single-particle and many-body fidelities, the half-chain von Neumann entanglement entropy, the local magnetizations σ i , and the local spin-spin correlations σ i σ i+1 during each step of the protocol. The protocol can be divided into three segments involving an initial singleparticle rotation and a final many-body Euler-angle-like rotation which reduces unwanted correlations [99].
Video 2 -Self-correcting Ising control Bloch sphere trajectories of the reduced density matrix of a single spin at the edge (i = 0) and in the bulk (i = 15) when starting from the initial ground state at J = −1, g x = 1.2, g z = 0.2 and preparing the critical-region target state. The red arrow/curve denotes the trajectory obtained via acting with the optimal QMPS protocol. The blue arrow/curve shows a suboptimal trajectory where at time step t = 5 a different than the optimal action is chosen. Afterwards, the systems is again evolved according to the optimally acting QMPS agent. We show two examples of perturbed trajectories consecutively. In both cases the agent is able to successfully prepare the target state despite the perturbation, thus illustrating the ability of the agent to generalize and adapt its protocols on-the-fly [99].
Video 3 -Self-correcting Ising control Same as Video 2. However, in this case the blue arrow/curve displays the Bloch sphere trajectory subject to noisy dynamics. Specifically, at each time step we add white Gaussian noise with std σ = 0.05 to the time step duration δt ± . We show two examples of noisy trajectories with different random seeds consecutively. In both cases the agent is again able to successfully prepare the target state which illustrates the robustness of the QMPS agent to noisy dynamics [99].

Reinforcement learning (RL) framework
In RL, a control problem is defined within the framework of an environment that encompasses the physical system to be controlled, and a trainable agent which chooses control actions to be applied to the system [ Fig. 7] [100]. The environment is described by a state space S and a set of physical laws that govern the dynamics of the system. We consider episodic training, and reset the environment after a maximum number T of time steps. At each time step t during the episode, the agent observes the current state s t ∈ S and receives a scalar reward signal r t . Depending on the current state s t , the agent chooses the next action a t+1 from a set of allowed actions A; this, in turn, alters the state to s t+1 . The feedback loop between agent and environment is known as a Markov decision process. The goal of the agent is to find an optimal policy (a function mapping states to actions) that maximizes the expected cumulative reward in any state [Sec. S. 2 B].
States -In our quantum many-body control setting, the RL state space S comprises all quantum states |ψ of the 2 N -dimensional many-body Hilbert space. Here, we consider states in the form of an MPS with a fixed bond dimension χ ψ : if χ ψ < χ max is smaller than the maximum bond dimension χ max = 2 N/2 , long-range entanglement cannot be fully captured, and the resulting MPS becomes a controlled approximation to the true quantum state [Sec. S.2 A]. Hence, state preparation of volume-law entangled states is restricted to intermediate system sizes when using MPS. On the other hand, for large system sizes, the control problems of interest typically involve initial and target states that are only weakly entangled such as ground states of local many-body Hamiltonians. In these cases, the optimal protocol may not create excessive entanglement suggesting that the system follows the ground state of a family of local effective Hamiltonians [101,102], similar to shortcuts-to-adiabaticity control [103], and thus, justifying a MPS-based description.
Actions -If not specified otherwise, the set of available actions A contains local spin-spin interactions and single-particle rotations, as defined in Eq. (2).
Rewards -Since our goal is to prepare a specific target state, a natural figure of merit to maximize is the fidelity F t = | ψ t |ψ * | 2 between the current state |ψ t and the target state |ψ * . To avoid a sparse-reward problem caused by exponentially small overlaps in many-body systems, we choose the log-fidelity per spin at each time step as a reward: r t = N −1 log(F t ). Moreover, we set a fidelity threshold F * , which the agent has to reach for an episode to be terminated successfully. Note that the agent receives a negative reward at each step; this provides an incentive to reach the fidelity threshold in as few steps as possible, in order to avoid accruing a large negative return R = T t=1 r t , thus leading to short optimal protocols. For assessing the performance of the QMPS agent to prepare the target state, we show the final single-particle fidelity F sp = N √ F as it represents a more intuitive quantity than the related log fidelity used in quantum simulation experiments. A detailed comparison of the control study results in terms of the achieved single-and many-body fidelities can be found in the Supplemental Material:, Sec. S.3.
In the case where the target state is the ground state of a Hamiltonian H, we can also define the reward in terms of the energy expectation value E t = ψ t |H|ψ t . Specifically, we can choose Similarly to the log-fidelity, this reward is always negative and becomes zero when evaluated on the target ground state. If the target state and therefore also its energy is a priori not known, one can alternatively replace E 0 with a large negative baseline which ensures that the rewards are always negative during training. Another advantage of the energy reward is the fact that expectation values can be efficiently calculated on a quantum device (in contrast to the fidelity). We report results obtained with this reward Training -Each training episode starts by sampling an initial state followed by taking environment (state evolution) steps. An episode terminates once the fidelity threshold is reached. After every environment step an optimization step is performed [see Sec. S.2 B for a detailed explanation of the algorithm].
We note in passing that we do not fix the length of an episode (the number of protocol steps) beforehand and the agent is always trying to find the shortest possible protocol to prepare the target state. However, we terminate each episode after a maximum number of allowed steps even if the target state has not been successfully prepared yet: otherwise episodes, especially at the beginning of training, can become exceedingly long leading to unfeasible training times.

Matrix product state ansatz for Q-learning (QMPS)
We choose Q-learning to train our RL agent [Sec. S.2 B], since it is off-policy and, thus, more dataefficient compared to policy gradient methods. The op-timal Q-function Q * (ψ, a) defines the total expected return starting from the state |ψ , selecting the action a and then following the optimal protocol afterwards. Intuitively, the optimal action in a given state maximizes Q * (ψ, a). Hence, if we know Q * (ψ, a) for every stateaction pair, we can solve the control task. In Q-learning this is achieved indirectly, by first finding Q * . This approach offers the advantage to re-use the information stored in Q * even after training is complete.
Since the state space is continuous, it becomes infeasible to learn the exact Q * -values for each state. Therefore, we approximate Q * ≈ Q * θ using a function parametrized by variational parameters θ, and employ the DQN algorithm to train the RL agent [104]. In this work, we introduce a novel architecture for the Q * -function, based on a combination of a MPS and a NN, called QMPS, which is specifically tailored for quantum many-body states that can be expressed as a MPS [Fig. 7]. We emphasize that the QMPS is independent of the MPS representation of the quantum state, and has its own bond dimension χ Q .
To calculate Q θ (ψ, a) for each possible action a in a quantum state |ψ , we first compute the overlap between the quantum state MPS and the QMPS. The contraction of two MPS can be performed efficiently and scales only linearly in the system size for fixed bond dimensions. The output vector of the contraction corresponding to the dangling leg of the central QMPS tensor, is then interpreted as a feature vector of dimension d f , which is used as an input to a small fully-connected neural network [ Fig. 7]. Adding a NN additionally enhances the expressivity of the Q * θ ansatz by making it nonlinear. The final NN output contains the Q * -values for each different action.
The QMPS feature vector can be naturally written as an overlap between the quantum state MPS |ψ and the QMPS |θ Q . Thus, the Q * -value can be expressed as where f θ (·) denotes the neural network. We additionally apply the logarithm and divide by the number of spins N in order to scale the QMPS framework to a larger number of particles. Note also that the QMPS does not represent a physical wave function (it is not normalized); however, for ease of notation, we still express it using the bra-ket formalism. Thus, the trainable parameters θ of the Q * -function contain the N + 1 complex-valued QMPS tensors |θ Q , plus the real-valued weights and biases of the subsequent NN. The QMPS feature dimension d f and the QMPS bond dimension χ Q are hyperparameters of the optimization, which determine the number of variational parameters of the MPS in analogy to the hidden dimension of neural networks. An advantage of the MPS architecture is that we can open up the black-box of the ansatz and training using well-understood concepts from the MPS toolbox. For example, it allows us to analyze the correlations in the quantum state and the QMPS by studying its entanglement properties.
Alternatively to the MPS ansatz, we can also represent the parameters of the Q-value network in terms of a matrix product operator (MPO)θ Q . The Q-value computation then amounts to computing the expectation value of the MPO ansatz with respect to the input quantum state, i.e., Q θ (ψ, a) = f θ ( ψ|θ Q |ψ ). For a detailed explanation of this architecture, we refer to the Supplemental Material:, Sec. S.2 D. Furthermore, in Supplemental Material:, Sec. S.3 A 2 we provide a performance comparison of different QMPS/NN architecture choices for the N = 4 qubit problem discussed in Sec. III A.
Note that the resources (time and memory) for training the QMPS framework scale at worst polynomially in any of the parameters of the system and the ansatz, such as the QMPS bond dimension χ Q , the feature dimension d f , and the local Hilbert space dimension d = 2. Furthermore, QMPS reduces an exponential scaling of the resources with the system size N to a linear scaling in N , therefore, allowing efficient training on large spin systems. Quantum many-body physics deals with the behavior of large collections of interacting quantum particles, such as electrons, atoms, or spins [105,106]. Hence, quantum many-body physics is used to study materials at the atomic scale, which can provide insights into their macroscopic features, such as conductivity, polarizability, or magnetism. Quantum many-body systems can exhibit a wide range of phenomena, from the emergence of novel phases of matter to the occurrence of quantum phase transitions and collective behavior that show exotic electronic and magnetic properties that are considered in the development of novel quantum technologies.
Quantum simulators provide a versatile testbed for studying many-body physics by mimicking the behavior of these quantum materials in a controlled environment [107]. By engineering the interactions between the constituent particles of the quantum simulator, one can create a system that displays properties similar to the target system, which would otherwise be difficult or impossible to realize in real materials. Examples of quantum simulators include ultracold atoms in optical lattices [3], nitrogen-vacancy centres [5], trapped ions [4], or photonic systems [108].
Paradigmatic systems that are often realized in quantum simulator platforms and that allow us to study a range of interesting quantum many-body phenomena are quantum spin models like the Ising model considered in the main text [109]. The quantum Ising model describes interactions between spins on a lattice. The quantum state of the spin system is governed by the Ising Hamiltonian [see Eq. (3)], which includes a coupling term between neighboring spins that tends to (anti-)align the spins depending on the sign of the coupling constant, and a magnetic field term that tries to align each spin along the direction of the field. The competition between the spin interaction and the magnetic field gives rise to different phases of matter and thus to a quantum phase transition.
Unlike classical phase transitions (such as the liquidgas transition in water), a quantum phase transition occurs at zero temperature by varying a non-thermal parameter of the system like the magnetic field [110]. At the phase transition the properties of the system abruptly change which leads to non-analyticities in the ground state properties such as diverging susceptibilites and correlation lengths. The latter are also typical characteristics of continuous phase transitions which feature a power-law scaling of correlations and the emergence of universal behavior that is independent of the specific details of the system. For example, the transverse field Ising model exhibits a critical point at J = g x where the system transitions from a disordered paramagnetic phase where spins are aligned along the magnetic field in xdirection to an ordered ferromagnetic phase where spins are uniformly aligned in either the positive or negative z-direction.
Quantum many-body ground states can also be studied and classified according to their entanglement properties [111]. To that end, the von Neumann entanglement entropy can be used as a measure of the amount of entanglement between a subset of particles and the rest of the system. It is defined in terms of the reduced density matrix ρ A = Tr |ψ ψ| for any bipartition A/B of the system The entanglement entropy can be regarded as a measure of how correlated a system is and thus, it is usually large for systems that are highly correlated. For ground states of local, gapped Hamiltonians it has been shown that the entanglement entropy scales proportionally to the boundary area of the partition and hence, these states are commonly referred to as area-law entangled states. On the other hand, gapless critical states result in a logarithmic scaling of the entanglement entropy. And finally, for systems with long-range interactions, the entanglement is usually spread over a large region of space. Hence, the entanglement entropy typically scales with the volume of the partition and the corresponding states are said to be volume-law entangled. Note that most quantum states of a many-body Hilbert space display a volumelaw of entangelement. Nontheless, area-law entangled states have received widespread attention since they can be efficiently simulated using the matrix product state formalism [15,16].

S.2. COMPUTATIONAL METHODS
A. Matrix product states (MPS)

MPS in quantum physics
Matrix product states (MPS) define a class of tensor networks (TN) -a computational tool developed to simulate quantum many-body systems [13,14]. Consider a lattice of N sites with local on-site Hilbert space dimension d j on site j [for spin-1/2 systems, d j = 2]. In general, the amplitudes, ψ, of a many-body wave function can be represented as a rank-N tensor ψ j1,...,j N , where j i ∈ {1, . . . , d i }. A TN constitutes a decomposition of the rank-N tensor into a product of lower-rank tensors. In particular, a MPS represents the wave function as a contraction over N rank-3 tensors, cf. Fig. S1: where |j 1 , j 2 , . . . , j N denotes the basis states. The physical indices j i = {1, . . . , d i } run over the local Hilbert space dimension. The non-physical indices, denoted by α, are referred to as bond indices; the corresponding bond dimension χ i determines the number of parameters in the ansatz which scales as N χ 2 d for a uniformly fixed bond dimension.
It has been shown that MPS define an efficient representation of states obeying the area-law of entanglement in one dimension, i.e., states whose entanglement entropy at any bond of the lattice is independent of the system size N [15,16]. For such states, the bond dimension χ becomes system-size independent and all MPS operations, such as computing overlaps, scale only linearly in N [as Figure S1. (a) Diagrammatic notation of a rank-3 tensor.
(b) Diagrammatic representation of MPS where d denotes the local Hilbert space dimension and χ the bond dimension of the virtual legs that are contracted over.
compared to the exponential scaling for generic states].
A class of states that automatically fulfill the area-law are ground states of local, gapped Hamiltonians. In contrast, states for which the entanglement entropy depends on the system size, e.g. linearly (volume-law states), or logarithmically (critical states), will inevitably require a bond dimension that increases with number of spins N for an exact representation; this results in computational resources scaling, at worst, exponentially with N . In the latter class fall quantum states of systems taken farfrom-equilibrium which feature a ballistic growth of the entanglement entropy [112,113]. Hence, the applicability of MPS is usually restricted to short-time dynamics or to time evolution generated by local Hamiltonians with times scaling at worst polynomially in the number of spins [65]. These conditions are satisfied by our proposed control setup and thus, justify a description using MPS. In Sec. III B and Sec. S.3 B we also analyze the entanglement entropies of the quantum states during time evolution which show that they can indeed by faithfully approximated by MPS.
MPS come with a well-developed toolbox of algorithms for calculating ground states (e.g., DMRG) [11], computing time evolution (e.g., TEBD), and with efficient algorithms for computing overlaps and expectation values of local observables [13,14]. In our simulations, we use the two-site density matrix renormalization group (DMRG) algorithm provided in the TensorNetwork library [94] to compute ground states, and a custom simplified version of time-evolving block decimation (TEBD) to time-evolve quantum states. Since we are only employing unitaries generated by a sum of commuting operators, the time evolution can be simplified, leading to a considerable computational speed-up. Each time evolution step is carried out by applying a sequence of single/two-site operators along the spin chain. After every application of a two-site operator, a singular value decomposition is performed to maintain the MPS form and to reduce the bond dimension by truncating the matrix containing the singular values. The error introduced in this process, can be quantified by the truncation weight , which is calculated from the norm of the discarded singular values λ and represents an upper bound for the total truncation error after one full TEBD sweep. As a measure of entanglement we use the von Neumann entanglement entropy; it can be expressed in terms of the singular values λ α for any bipartition of the system at bond i:

MPS in machine learning
While MPS were originally developed within the condensed matter community to study the low-energy properties of quantum many-body systems, the variational expressivity of MPS and, more generally, tensor networks, has recently been harnessed to solve machine learning problems [58-60, 92, 114-127]. In one of the first examples of MPS-based machine learning, an MPS ansatz is used for the compression of the weight matrix in a linear classification task [58]. In this approach the classical data is first mapped to a "spin" state defined on a higher dimensional space, which is then contracted with the weight MPS. The latter features one tensor (usually in the center) containing a dangling leg representing the resulting prediction of the model, e.g., the class probabilities. So far most applications of MPS-based machine learning involved supervised learning tasks such as classification [58,92,[114][115][116], or unsupervised learning tasks such as generative modeling [59,117,118], sequence modeling [119,120], and anomaly detection [121]. Recently, MPS have also been utilized as a feature map for classical data in a quantum reinforcement learning (RL) framework [52].
The MPS architecture can be optimized via conventional gradient descent, and the gradients can be obtained through backpropagation analogous to the optimization of neural network parameters. Alternatively, one can use a DMRG-style routine for MPS that only locally optimizes one or two tensors at a time, while sweeping back and forth through the MPS [58]. This algorithm has the advantage that the bond dimension can be adapted dynamically during the optimization and has also shown better stability for large system sizes (N > 200) where backpropagation fails due to exponentially vanishing gradients [128,129]. Since we are dealing with intermediate system sizes N < 200 in this study, we used backpropagation for calculating the gradients of the MPS tensors.

B. Q-learning and DQN
Within the RL framework the agent chooses actions according to a strategy, called policy π(a|s) -a function that assigns a probability to every action a depending on the current state s of the environment [100]. The goal is to find the optimal policy π * (a|s), i.e., the optimal action to take in any state s that maximizes the expected return R = E π T t=0 r t |s 0 = s starting from state s and following the policy π. In this work, we consider episodic tasks involving a termination condition (e.g., a fidelity threshold) which the agent has to reach within a fixed number of steps T . Once the termination condition is satisfied, the episode is over and the environment is reset. This is in contrast to non-episodic tasks which continue indefinitely.
Q-learning is a model-free RL algorithm in which the agent learns an optimal policy π * (a|s) solely via observing environment transitions, i.e., without knowing or building a representation of the environment dynamics, and without access to any prior information about the system [130]. To every fixed policy π (optimal or suboptimal), we can assign a Q-function, defined as the expected return starting from state s, taking action a, and following the policy π afterwards: The discount factor γ ∈ (0, 1] gives a higher importance to immediate rewards and therefore ensures stability for continuing, non-episodic RL tasks. In Q-learning, the optimal policy π * is found indirectly through learning the optimal Q-value function Q * (s, a) that gives the maximum expected cumulative discounted reward: Q * (s, a) = max π Q π (s, a). (S.7) Once the optimal Q-values are known, the optimal policy is deterministic: π * (s) = arg max a Q * (s, a), i.e., it is given by greedily taking actions according to the maximum optimal Q-value in each state.

Tabular Q-Learning
When the state space is discrete, the optimal Qfunction can be learned using tabular Q-learning through an iterative update rule derived from the Bellman optimality equation [130] Q k+1 (s, a) ← Q k (s, a) + αδ k , where k denotes the iteration step of the algorithm, α ∈ (0, 1] is the learning rate, and δ k is the temporal difference error. Note that Q-learning requires isolated tuples (s, a, r, s ), known as transitions, and not complete trajectories. Moreover, due to the presence of the max function in the update-rule above, the algorithm is offpolicy: this means that the transitions can come from any policy (also old ones) -and yet the new updated Q-function approaches the optimal Q * . Convergence is guaranteed if each possible state-action pair (s, a) can, in principle, be visited infinitely often. To fulfill this condition the agent has to explore sufficiently different state-action pairs. At the same time, the agent should also exploit the high-reward transitions, especially towards the end of training when the Q-value estimates have mostly converged to their true values. A common choice of behavior policy to follow during training that satisfies this Exploration-Exploitation dilemma, is angreedy policy: a = random action with probability arg max a Q π (s, a ) otherwise , (S.9) i.e. the agent chooses a random action with some small probability and the greedy action maximizing the Qvalue otherwise. The hyperparameter can be decreased, e.g., exponentially, starting from a value close to 1 (exploration-dominated regime) at the beginning of training, to a small value, e.g., = 0.01, leading to less exploration and more exploitation as training progresses.

Deep Q-Learning
For large or continuous state spaces, such as Hilbert spaces, the tabular Q-learning algorithm described above is inapplicable. In such cases, it is only possible to learn an approximation to the optimal Q-values, Q θ (s, a) ≈ Q * (s, a), given by a parameterized function, e.g., a neural network [104]. The parameters θ of the variational ansatz are then optimized by minimizing the expected meansquare temporal difference error L k (θ k ) = E (s,a,r,s )∼R y k − Q θ k (s, a) The minibatch of transitions (s, a, r, s ), used in each optimization step k, is uniformly sampled from a fixed-size replay buffer R that contains previously collected transitions from agent-environment interactions. Since Qlearning is an off-policy algorithm, transitions used for updating the Q-value do not have to coincide with the target policy allowing the use of experience replay. Thus, the subroutine of collecting environment transitions can be run independently and, if necessary, in parallel to the optimization subroutine, thus speeding up training. Therefore, the use of a replay buffer makes Q-learning more data-efficient than policy gradient methods.
Note that the RL loss function L k in Eq. (S.10) is different from the loss in supervised learning, in that the regression target y k = r + γ max a Qθ k (s , a ) itself depends on the parameterized Q-values that have to be learned; therefore, the target (i.e., the label) changes in the course of training. This running target makes DQN different from ordinary gradient descent, and is the reason for the lack of convergence guarantees in DQN. To stabilize deep Q-learning, a second target Q-value network Qθ k (s, a) is introduced whose parametersθ are held fixed during the optimization step. The optimized parameters θ are periodically copied to the target networkθ ← θ.
Finally, we also employ Double Q-learning to reduce overestimation errors in the Q-values [131]. Here, the regression target is replaced by a )). (S.11) The full training algorithm is called DQN and we show the corresponding pseudocode in Algorithm 1.

C. Details of the QMPS architecture and training
In all examples discussed in this manuscript the QMPS tensors are initialized as identity matrices with Gaussian noise (σ = 0.2) added to all components both for the real and complex parts. The tensors are additionally scaled by a factor of 0.25. The neural network weights and biases are initialized with real Gaussian random numbers (σ = 0.1). All parameter values of the QMPS framework are summarized in Tab. I. The values of the hyperparameters including the time evolution step sizes δt ± are obtained by performing a coarse grid-search, i.e. we trained on a few different parameter values and select the ones which yield best performance results. Note that we adopt slightly different values for δt + and δt − . This choice prevents the agent to simply undo an action by evolving with the inverse operator which helped stabilise training.
As mentioned in the main text a single QMPS agent is not able to reach arbitrarily high fidelities due to the discreetness of the action space, the constant step size, and the fixed maximum episode length. An additional challenge is posed by the large deviation in the expected return values for states at the beginning and the end of the episode: The QMPS network is not able to resolve small differences in the reward which is however required close to the target state where the log fidelities approach zero. Therefore, we introduce a multi-stage learning scheme in Section III A where successive agents with tighter fidelity thresholds are trained starting from states which are pre-prepared from agents optimized on smaller thresholds. This training strategy also allows the step size to be chosen separately for each agent.

Optimization
The gradients of the neural network and the QMPS parameters can be computed via conventional backpropagation and, in principle, any automatic differentiation library can be employed for this task. However, we obtained a considerable speed-up (factor of ∼ 10) by implementing the gradient computation from scratch. The neural network takes as input the real-valued QMPS feature vector and therefore the parameters are chosen to be real-valued as well. On the other hand, restricting the QMPS tensors to real numbers greatly limited the expressivity of the ansatz. Hence, each tensor component is comprised of both a real and imaginary parameter. Due to the overall QMPS ansatz being not holomorphic (the absolute value is not complex differentiable), the real and imaginary parameters have to be updated independently by computing the gradient with respect to each of them separately.

Compute resources
For a system of size N , local Hilbert space dimension d, and uniformly fixed bond dimension χ, the number of MPS parameters scale as N χ 2 d. The quantum state MPS time evolution (based on SVD and matrix multiplication) as well as the QMPS optimization (based only on matrix multiplication) scale linear in N and at worst polynomial in χ and d. We have not fixed the bond dimensions of the MPS and QMPS to be uniform on all sites, but rather let both of them grow exponentially from the boundary up to a maximum uniform bond dimension in the middle of the MPS.
Note that the linear complexity scaling of the QMPS framework stands in contrast to the exponential scaling expected when using a conventional neural network architecture and training on the full wave function. Taking the N = 32 case study as an example and assuming a batch size of 64, one would need to load 64 × 2 32 = 10 11 parameters onto the CPU/GPU which would result in at least 1 TB of data. Additionally, we would have to store the neural network parameters which takes as an input the full wave function, and take into account the replay buffer which stores extra ∼ 8000 states. Hence, from a viewpoint of memory resources alone, a tensor-network approach is required for the simulation of quantum manybody systems and the training thereof. Merging the TN architectures with RL is what enables the ML-based control of 32 and more spins discussed in the main text.
For the hyperparameters chosen in this study, most time was spent in the optimization step which requires two forward passes and one backward pass on a batch of input states. Overall, one full episode of training (including 50 environment and optimization steps) for N = 32, χ Q = 32, χ ψ = 16, d f = 32, and a batch size of 64 took 8.7 sec on a Intel Xeon Gold 6230 CPU and 1.5 sec on a NVIDIA Tesla P100 SXM2 GPU. Reducing the QMPS bond dimension to χ Q = 16 leads to runtimes of 4.3 sec (CPU) and 1.5 sec (GPU) respectively. Let us note that the code has not been optimized for a GPU and with some modifications an even larger speedup can be expected. Therefore, larger system sizes should also be within reach in the near future.

D. QMPO architecture
The QMPS architecture makes use of an MPS ansatz to extract relevant features from the input quantum state. Hence, we can interpret the QMPS as a set of quantum states (up to normalization); calculating the feature vector then amounts to computing the fidelity between the input state and each QMPS state. However, rather than learning parameterized "quantum states" |θ Q and evaluating inner products, we could also learn parameterized operators (or observables)θ Q that act on the evolved physical quantum state instead. If the operators are further restricted to be hermitian, we are able to express the feature vectors as an expectation value of the operatorθ Q and hence the Q-values are given by where f θ denotes the subsequent NN that the feature vectors are fed through.
Similarly to the MPS representation of quantum states, we can decompose a hermitian operatorθ into a product of local tensors of rank 4, called matrix product operator (MPO) [132] By inserting an additional tensor with a dangling leg at the center of the MPO, we introduce an ansatz analogous to the QMPS that maps an input quantum state to a feature vector via computing expectation values [see Fig. S2].
The QMPO tensors are initialized as identity matrices with Gaussian noise (σ = 0.5), added both to the real and imaginary parts of each parameter X When training, we work with real parameters only. Therefore, we perform a Cholesky decomposition to retrieve the X operators, i.e., (S.14) We then define the optimizable parameters to be the real and imaginary parts of theX operators. Training works analogously to the QMPS optimization, that is, via gradient descent and backpropagation. The only difference is that due to the hermiticity requirement of the operators O, we always need to perform the additional matrix multiplication step of Eq. (S.14).
The QMPO framework has some advantages over QMPS. First, the Q-value computation can now be interpreted as the measurement of an observable which can be performed efficiently on NISQ devices, if we restrict the observable to be local. Second, the expectation value does not vanish (or explode) exponentially in the system size as is the case for fidelities and hence, we expect that the training of this ansatz is more stable for larger system sizes and requires less hyper-parameter tuning (especially with respect to the parameter initialization). However, the QMPO ansatz is also computationally more demanding, e.g., computing the expectation value in Eq. (S.13) scales as χ 4 whereas the calculation of the fidelity in the QMPS only scales as χ 3 (assuming that the bond dimensions of the MPS and MPO are equal and uniform).
In Sec. S.3 A 2 we report results obtained with the QMPO ansatz and compare it with other architecture choices.

S.3. DETAILS OF THE CONTROL STUDIES
Parameters related to the RL environment and the spin systems of each control study can be found in Tab. II.
A. Universal state preparation from arbitrary initial quantum states

Single-particle control
To provide another benchmark of the QMPS framework, we test it in a single-particle control setting. Our goal is to prepare a specific state (here chosen to be the spin-up state) from any other single-particle state. We translate this setup to the many-body regime by considering N = 64 spins uniformly polarized in one direction on the Bloch sphere, which can be exactly approximated with an MPS of bond dimension χ ψ = 1. Note that an arbitrary single-particle spin state can always be expressed as |ψ = cos(θ/2)|0 +e iφ sin(θ/2)|1 , where 0 ≤ θ ≤ π and 0 ≤ φ < 2π.
We train a QMPS agent starting each episode from a uniformly sampled state on the Bloch sphere, with a fixed single-particle fidelity threshold of F * sp = 0.9995 (many-body fidelity F * ∼ 0.97), and an action set that is composed only of single-particle rotations A = {X, −X,Ŷ , −Ŷ ,Ẑ, −Ẑ}. Figure S3 shows the achieved fidelities between the final and target state for different initial states represented by the angles θ and φ. We find in all cases that the QMPS agent is able to successfully reach the fidelity threshold and hence is capable of performing universal single-particle state preparation. We also plot the protocol length starting from each initial  Figure S3. Universal single-particle control -Upper panel: Single-particle fidelity between the final and z-polarized target state (θ = 0) as a function of the initial state parameters θ and φ. In state, which, as expected, increases as the distance between the initial and the z-polarized target state (θ = 0) becomes larger [ Fig. S3(a)].

Universal ground state preparation for N = 4 spins
In Fig. S4 (a) we provide the learning curves of the first QMPS-1 agent which was trained to prepare the Ising ground state with a many-body fidelity threshold of F * ∼ 0.85 [Sec. III A]. When testing the trained QMPS-1 agent on a set of 10 3 random initial states, in 99.8% of instances the fidelity threshold is attained within the 50 allowed number of steps. In Fig. S4 (b) we show the corresponding learning curves of a QMPO agent [see Sec. S.2 C for the architecture and optimization procedure details] trained on the same problem. For 10 3 randomly sampled initial states, we reach a success rate of 100% which suggests that the QMPO ansatz provides an alternative, expressive ansatz for the state preparation scenario at hand.
Next, we investigate how the trainability and the final success rates of the QMPS-1 agent depend on the architecture choice. For comparison, the success probability of 99.8% and the learning curves of Fig. S4 were obtained on a hybrid MPS+NN architecture where the NN contained two hidden layers, each of dimension 100. Figure S5(a) instead shows the learning curves for a hybrid MPS+NN architecture with a single hidden layer of dimension 100. In this case, the agent is only able to successfully prepare the target state for 97.9% of the random initial states which suggests that a deeper NN can indeed increase the expressiveness of the overall ansatz. Figure S5(b) displays the learning curves of an agent composed of an MPS followed by a linear layer. The average fidelity converges to a lower value than the fidelity threshold and the final success rate computes to 35.9% only. This indicates that for the task of universal four-qubit control the non-linearities in the NN greatly increase the expressiveness and are required for successful training. Finally, in Fig. S5(c) we plot the learning curves of a QMPS architecture consisting only of an MPS. We found that training becomes unstable across all considered hyperparameters and random seeds of the optimization. The final success probabilites are 1.6%. Note that under certain simplifications of the control problem (e.g., smaller fidelity threshold, restriction to real initial wave functions), we were able to successfully train an MPSonly architecture. However, the average episode lengths were in general larger and hence less optimal than the ones we obtain from the hybrid MPS+NN ansatz.
The small Hilbert space dimension of the N = 4 qubit control problem has the advantage that we can compare the QMPS framework to an RL training scheme in which we optimize a conventional feed-forward NN on the full wave function data. To that end, we simulate the quantum state exactly, i.e., without making use of the MPS formalism, and feed the full 2 4 component wave function vector into a NN with two hidden layers and hidden dimensions of 100 each. Since the wave function components are complex-valued numbers, we double the size of the NN input vector (2 × 2 4 = 32) to account for the real and imaginary parts of the wave function. Figure S5(d) shows the resulting learning curves of the best performing Q-network which we choose after a coarse grid-like hyperparameter search and out of three different random seeds of the optimization. The final success rate is evaluated to 94.3% and thus substantially lower than the success probability of 99.8% we obtained from the hybrid QMPS architecture that involves a NN of the same size. Doubling the NN hidden dimensions of the Q-network to 200 gave rise to a success rate of 96.8% instead. Hence, even for small system sizes, the QMPS ansatz can lead to a performance enhancement compared to the more tradi- tional NN-only architecture. Note that the QMPS ansatz for the N = 4 qubit problem represents only a small computational overhead over the NN-only ansatz; in fact the number of optimizable MPS parameters for this problem computes to 592 while the number of parameters contained in the NN are given by 14612 (hidden dimension of 100) and 49212 (hidden dimension of 200). This suggests that the MPS ansatz indeed represents an efficient and natural architecture when learning from quantum states and can be beneficial even in the limit of small system sizes where an MPS description is usually not necessary. Finally, we provide additional results obtained when training a QMPO agent with a modified reward function: Instead of using the log-fidelity between the target |ψ * and the current state |ψ t , we compute the energy, i.e., the expectation value of the Ising Hamiltonian H with respect to the current state E t = ψ t |H|ψ t . The reward is then defined as r t = (E 0 −E t )/N where E 0 is the true ground state energy of H. Note that this reward description might be advantageous in situations where the target state is unknown or the training is performed directly on a quantum device. Figure S6 shows the learning curves obtained for two different energy thresholds (E 0 −E * )/N of −0.1 (a) and −0.15 (b) respectively. To benchmark the final performance of the QMPO agents, we again sample M = 10 3 random initial states and find that for 98.9% (a) and 99.7% (b) of those instances the energy threshold is surpassed when acting with the optimized QMPO protocols. To compare these results with the ones attained for the log-fidelity reward, we also compute the average final fidelityF = M −1 M i | ψ * |ψ i T | 2 at the end of each episode which yieldsF = 0.88 (a) and F = 0.79 (b). For comparison the fidelity threshold used in Fig. S4 was set to F * ∼ 0.85.

B. Preparation of a polarized product state from paramagnetic ground states for N = 32 spins
This section provides further details on the case study presented in Sec. III B.
To speed training up, we restrict the action space to 7 actions for this control setup with A = {Ŷ ,Ẑ, −Ẑ,XX, −XX,ŶŶ , −ŶŶ }. This action set was determined by first training on a smaller system size (N = 8) using all actions and then selecting only those that appear in the final optimal protocols.
To get an intuition about the difficulty of this control setup, we proceed as follows: (i) we demonstrate that the initial states are sufficiently far (in the Hilbert space distance) from the target state, e.g., by computing the fidelity as a function of g x , cf. Fig. S7 [dashed lines]. This corresponds to a protocol where the agent does not take any action. (ii) an alternative protocol can be produced by noticing that the initial states are paramagnetic, while the target is a z-polarized state, and thus a π/2-rotation about the y-axis presents a good candidate for the optimal protocol [it is indeed optimal in the limit g x → ∞]. The corresponding fidelities are shown in Fig. S7 [solid lines]. Compared to these fidelities, the threshold for the RL agent is given by the horizontal dashed line; it gives a lower bound on the performance of the QMPS protocols. Notice that, the QMPS agents are able to considerably improve on the initial fidelity and outperform the trivial Y -rotations for the considered range of transverse field values.
The QMPS agent was trained to prepare a specific target state (the z-polarized state) starting from a class of Ising ground states. In principle, the obtained protocols can be inverted to achieve the opposite, i.e. prepare any paramagnetic Ising ground state from the z-polarized state. This is often the objective in quantum computing or simulation tasks where the system starts out in a simple product state and is then brought into the state that has to be investigated or that encodes the solution to a problem. With a single trained QMPS agent one can generate optimal controls that, when reversed, prepare a variety of different states that can then be used for computation. Note however, that the final state reached by using the original QMPS protocol does not exactly coincide with the target state since we do not achieve a perfect fidelity of unity. It is therefore not clear whether the inverse protocol, when starting from the exact target state, prepares the original initial states with an equally high fidelity or whether it does considerably worse. In Fig. S8 we provide the achieved single-particle fidelities when preparing Ising ground states from the z-polarized state by reversing the optimal QMPS protocols and compare them to the fidelities of the original state preparation routine [cf. Fig. 3 of Sec. III B]. We find that the fidelities do not differ significantly which justifies that inverse state preparation using the QMPS protocols is possible in this particular control scenario.
In Fig. S9 we display three optimal QMPS protocols starting from initial ground states at g x = 1.01, 1.1, 1.3, respectively. Interestingly, in all three cases, the agent learns to initially apply a π/2-rotation about the yaxis which is decomposed into three consecutive protocol steps since we fix the duration of each applied unitary to be δt + = π/12. As discussed in the main text, the agent is able to successfully prepare the target state also for initial ground states outside of the training interval, i.e., for g x > 1.1. Note however, that the predicted Q-values of the QMPS agent can be quite different from the true return when tested outside of the training interval, yet the policy learned by the agent can still produce meaningful optimal protocols. In the bottom panels of Fig. S9 we show the half-chain von Neumann entanglement entropy S ent of the encountered states when evolving according to the optimal protocols. The entanglement entropy stays small and hence, allows the time evolved system to be simulated with a relatively small bond dimension χ ψ (for training we set χ ψ = 16). Note however, that the entanglement entropy is not fully reduced to zero at the end of the protocol which is especially the case for the states close to the critical point. This can be attributed to the logarithmic scaling of the entanglement entropy in the initial critical state, with the subsystem size N/2 due to  Figure S7.
Transverse-field Ising control -Singleparticle fidelity (blue, left y-axis) and many-body fidelity (orange, right y-axis) between the target z-polarized state and an initial Ising ground state for different values of the transverse field gx (dotted curves). The solid lines show the respective fidelities of the target state and the initial ground state after applying a single-particleŶ -rotation with δt− = π/4. The QMPS agent is able to improve on the trivial rotation and prepare the target state with single-particle fidelities  Figure S8. Transverse-field Ising control -Final singleparticle fidelities when starting from Ising ground states with transverse field values gx and preparing the z-polarized target state (blue curve). The orange points show the fidelity when reversing the state preparation scenario, i.e. one starts from the polarized state and applies the inverse QMPS protocol to reach the corresponding Ising ground state. The fidelities achieved by the reversed protocol are comparably high. Therefore they justify that, in this case study, the trained QMPS agent can be employed for the inverse state preparation task as discussed in the main text.
Transverse-field Ising control -Analysis of the optimal QMPS protocol obtained when starting from the initial ground state at gx = 1.01. Shown are the local spin-spin correlations σiσi+1 and the local magnetization σi along each direction at the center of the spin chain (i = 15). The yellow shaded segment indicates the initial Y -rotations which align the spin along the z-axis; the blue shaded area points to a generalized Euler-angle-like manybody rotation which reduces unwanted correlations and disentangles the state. N = 32 spins. See Video 1.
middle panel], the short protocol is not capable of destroying all long-range correlations which persist in the final state. The prepared state, therefore, has a finite many-body overlap with the separable target state [see orange curves/axis in Fig. S9 showing the many-body fidelity], which explains the discrepancy in the final entanglement entropies.
In Fig. S10 we plot the local expectation value of the magnetization along each direction and the local spinspin correlations at the center of the chain which reveal the role of each unitary occurring in the protocol sequence corresponding to the state preparation task shown in Fig. S9(a) [g x = 1.01]. As already mentioned, the first threeŶ -rotations align the state along the z-axis bringing the expectation values of theX andŶ component to zero. The role of the remaining unitaries is to decrease unwanted correlations and consequently to disentangle the state. Note that the XX and ŶŶ correlations approach zero in an intertwined manner which is caused by an intricate combination ofXX andŶŶ disentangling gates and single-particleẐ-rotations. The latter are important for finely realigning the state after each disentangling operation and as such preventing the correlations from diverging [see Video 1].
Next, we analyze how well the optimal protocols perform on systems with a different number of spins. To this end, we test the protocols optimized for N = 32 spins (QMPS 32 ) on N = 8, 64 spin systems, and vice-versa: protocols obtained after training on N = 8 spins (QMPS 8 ) are tested on the larger N = 32, 64 systems [ Fig. S11]. We find that the fidelity threshold can still be reached when applying the QMPS 32 protocols on smaller system sizes. However, the opposite is not true: the QMPS 8 protocols, in general, give rise to fidelities below the threshold when tested on the N = 32, 64 spin systems. These system-size (in)dependence suggests that, for this particular control setup, one can devise suitable pretraining techniques for large system sizes, based on the behavior of agents successfully trained on smaller systems. Moreover, the QMPS agent tends to find optimal protocols which appear robust to changes in the system size, and the control task likely admits a solution also in the thermodynamic limit.
Finally let us comment on the quantum state entanglement and the related MPS bond dimension. During the initial exploratory stage of training, random unitaries are applied to the system leading in general to a ballistic growth of the entanglement entropy. In this case the fixed bond dimension of χ ψ = 16 is not always suffi- Figure S13. Self-correcting mixed-field Ising control -Many-body fidelity F (left) and single-particle fidelity Fsp (right) between the critical-region target state and the initial ground states at transverse and longitudinal field values gx, gz. The trained QMPS agent prepares the target state with many-body and single-particle fidelities F > 0.61, Fsp > 0.97 respectively, and therefore considerably improves on the initial fidelity values. N = 16 spins. Figure S14. Self-correcting mixed-field Ising control -Half-chain von Neumann entanglement entropy of final states during training. The dark green curve denotes the average over 200 episodes and the orange dashed line indicates the entanglement entropy of the critical-region target state. The entropies decrease as learning progresses and converge to a value close to that of the target state. Inset: Truncation error averaged over a window of 3000 transitions (dark purple). Training was performed with a quantum state bond dimension of χ ψ = 16; for testing χ ψ = 32 was used. N = 16 spins. cient to capture the evolved quantum states giving rise to the large truncation errors, shown in Fig. S12. These truncation errors, however, naturally decrease as training progresses and the action selection becomes more deterministic while the Q-function converges close to the optimal one. While for training a bond dimension of χ ψ = 16 was used, testing was performed with χ ψ = 32 for which the truncation errors vanished to machine precision. This check ensures the stability of the QMPS protocols to changes in the accuracy of the MPS approx- This section provides further details on the case study presented in Sec. III C.
To compare the achieved fidelities of the QMPS agent reported in Fig. 4(a), we show the fidelities between the initial mixed-field Ising ground states and the criticalregion target state before any controls are applied in Fig. S13. The QMPS agent is able to reach the target with single-particles fidelities F sp > 0.97 (corresponding to a many-body fidelity of F ∼ 0.61).
The half-chain von Neumann entanglement entropy of the quantum states during training is displayed in Fig. S14. Similar to the previous case study, during the initial stages of training the encountered states are highly entangled due to the randomness of the action selection. Once the agent learns how to reliably prepare the target state, the entropies decrease and are scattered closely around the target state value [orange dashed line]. We emphasize that critical states possess a logarithmic correction to the area-law of entanglement, which makes their preparation a non-trivial task. For training we used a relatively small bond dimension of χ ψ = 16 which led to the finite truncation errors shown in the inset of Fig. S14. However, training is still successful and all subsequent tests were performed by setting χ ψ = 32 which did not affect the QMPS protocols or the final achieved fidelities.
Next, we provide further examples showcasing the ability of the QMPS agent to self-correct its protocols onthe-fly when the time evolution is noisy or perturbed. In Fig. S15, we consider two different initial ground states, one within the training region (J = −1, g x = 1.2, g z = 0.2) [(a)-(d)] and one outside (J = −1, g x = 1.0, g z = 1.0) [(e)-(f)], and analyze the success of state preparation subject to different protocols. In the upper panels of each subplot in Fig. S15 we display the actions of the optimal protocol [top] and one exemplary protocol that has been perturbed [bottom]. The lower panel always shows the single-particle fidelities at each step of the protocols.
First, we take the optimal QMPS protocols and mod-  Figure S16. Self-correcting mixed-field Ising control -Time-dependence of the single-particle fidelity when adding Gaussian random noise with standard deviations σ = 0.01 (a)-(b), and σ = 0.05 (c)-(d) to the time step duration δt± and starting from an initial ground state at J = −1, gx = 1.2, gz = 0.2. For each figure, we sampled 100 different trajectories (each corresponding to a different random seed). The original, unperturbed QMPS protocol is always indicated by the magenta line. The single-particle fidelity threshold of F * sp = 0.97 (F * ∼ 0.61) is denoted by a horizontal gray dashed line and the number of steps required to reach the threshold in the noise-free case is indicated by a vertical black dotted line. In (a),(c) we always evolve according to the fixed, unperturbed QMPS protocol we obtained from the noise-free simulation. The percentage of successfully prepared target states within 50 steps is 90% in the case of weak noise (σ = 0.01) and 0% in the case of strong noise (σ = 0.05). In (b),(d) we use the adaptive QMPS agent to generate different protocols for each distinct run (compare to Fig. S15(a)-(d)). The respective success percentages are 100% (σ = 0.01) and 74% (σ = 0.05). Hence, in both instances, the self-correcting agent is able to improve over the fixed, noise-free protocol.
ify it at time step t = 5 (t = 15) by taking 5 random suboptimal actions instead. Afterwards the system is again evolved according to the greedily acting QMPS agent giving rise to 6 distinct trajectories (the magenta line corresponds to the unperturbed one). In most cases the agent is able to correct for the mistake by adapting the subsequent protocol and reaches the fidelity threshold nonetheless. However, in one instance [ Fig. S15(a)], the resulting QMPS protocol does not converge and the agent fails to prepare the target state. Hence, the agent is not able to generalize to states generated by this particular protocol sequence, and likely predicts wrong Q-values that steer the agent eventually away from the target state. This is, however, not surprising since the agent has only been trained on states within a small part of the many-body Hilbert space and therefore, it cannot be expected to devise successful protocols from arbitrary states.
Note that for the initial state outside of the training interval [(e),(f)], the perturbation of the original QMPS protocol gave rise to a shorter protocol, i.e. the fidelity threshold is reached in a fewer number of steps. Hence, in this case the original protocol is not a local minimum of the control landscape. However, this is not surprising, since the QMPS agent has not been trained on this initial state and therefore, the predicted Q-values are not guaranteed to have converged to the true optimal values. Finally, we study the robustness of the QMPS agent to a randomized time step duration δ ± by adding white Gaussian noise with standard deviation σ = 0.01, 0.05 to it [ Fig. S15(c),(d),(g),(h)]. We evolve the system with 5 different random seeds giving again rise to 6 distinct trajectories (the magenta line corresponds to the unperturbed one). For each of the 5 randomized time evolutions, the QMPS agent has to eventually adapt its protocol by performing a different sequence of actions. It successfully prepares the target state in all but one case which falls outside the training region [ Fig. S15(h)]. Note here as well that in some instances the agent is able to devise shorter control protocols compared to the original unperturbed ones. Hence, the randomized step duration can have a positive effect on the control problem allowing for faster state preparation protocols.
In Fig. S16 we compare the achieved fidelities in the presence of noise when we evolve with the adapted protocols (bottom) and with the original, noise-free protocol (top) starting from an initial state J = −1, g x = 1.2, g z = 0.2 within the training region. We again consider Gaussian random noise with standard deviations σ = 0.01, 0.05 and repeat the time evolution with 100 different random seeds for each of the two cases. When the noise is weak (σ = 0.01), the fixed, unperturbed protocol gives rise to qualitatively similar fidelity curves regardless of the seed (see Fig. S16(a)). We found that for 90 out of the 100 runs, the protocol successfully prepares the target state, that is, the fidelity threshold is reached within 50 steps. However, in the cases where the fidelity threshold is not surpassed, the final fidelities are close to the target value of F * sp = 0.97. If we instead allow the QMPS agent to adapt to the perturbed states, we obtain the fidelity curves in Fig. S16(b). The resulting protocols give rise to qualitatively different trajectories that diverge more towards later time steps (compare to Fig. S15(a)-(d)). However, in this case the agent successfully prepares the target state for each of the 100 runs. Hence, for sufficiently weak noise strengths, the original unperturbed protocol is expected to give qualitatively similar results to the noise-free dynamics. However, even in this example the self-correcting agent has a measurable advantage over the fixed protocol.
This situation changes when we consider the case of strong noise (σ = 0.05) as shown in Fig. S16(c)-(d). The fixed, unperturbed protocol leads to diverging fidelity curves already after a few steps (top). In fact, the success probability for the simulated 100 runs is 0. In contrast, the adaptive agent prepares the target state successfully for 74 out of the 100 instances within the 50 allowed time steps. Moreover, the agent clearly tries to steer the quantum states towards high fidelity regions. This example therefore demonstrates that the self-correcting agent is able to improve over the original, noise-free protocol when the dynamics is being perturbed.

S.4. NISQ IMPLEMENTATION OF THE QMPS FRAMEWORK
A. QMPS to quantum circuit mapping In the following we illustrate the QMPS to circuit mapping on the example of a N = 4 spin/qubit system. The QMPS state θ Q is represented as  [3] and denotes the feature vector index. Our goal is to rewrite the QMPS state as a quantum circuit θ Q = U θ |0 , where the state preparation unitary U θ is composed of several gates U θ = G 4 · · · G 1 .
First, we transform the QMPS into the left canonical form via successive QR or singular value decompositions   In principle, the resulting tensors A [i] also depend on the feature index after performing the canonicalization.
However, in what follows we omit the index and assume that all subsequent steps are performed for each of the values of separately. The quantum circuit mapping of θ Q is depicted in Fig. S17(a). We can interpret the rightmost tensor A [4] as a single-qubit unitary, i.e., G 4 = A [4]j4 α3 as it satisfies Eq. (S.18). Similarly, we can rewrite the adjacent tensor A [3]j3 α2α3 with dimensions 4 × 2 × 2 as a two-qubit unitary after reshaping the index α 2 : G 3 = A [3]j3,α3 α2,α 2 . The next tensor A [2]j2 α1α2 represents an isometry with input dimension 2 and output dimensions 4 × 2. Hence, we need to extend the columns of A [2]j2,α2,α 2 α1,0,0 by padding it with the (2 3 −2) orthonormal vectors X in the kernel of A [2] † . The resulting square matrix G 2 = [X A [2] ] is then chosen as the three-qubit unitary. Finally, we can apply the same  Figure S18. Universal four-qubit control -(a),(b) We sample 1000 random initial states and apply the QMPS circuit framework with a varying number of measurement shots. In (a) we display the percentage of runs in which the target state is successfully reached (i.e., the fidelity threshold of F * ∼ 0.85 is surpassed after at most 50 protocol steps). The success rate under exact computation (without sampling) is shown as a black dashed line. The success probability when acting random is zero (gray dotted line). We provide both, the results for the full χ = 4 QMPS circuit (purple solid line) and the truncated χ = 2 QMPS (orange dash-dotted line). We find that as low as ∼ 500 shots are sufficient for reaching success rates close to 100%. (b) The corresponding average number of required protocol stepsT for reaching the fidelity threshold. The standard deviation is indicated by the shaded areas. The black dashed line corresponds again to the average value computed via exact techniques, the gray dashed-dotted line indicates the maximum number of allowed episode steps (50). (c),(d) The success rate and average protocol length when adding an amplitude and phase damping noise channel with error parameter p after each gate. The noise parameter for all single-qubit gates is always fixed to p1 = 10 −4 . The number of measurements shots is set to 4096. For error parameters p < 10 −3 we are able to retain high success probabilities. (e),(f ) The success percentage and average protocol length when adding (coherent) Gaussian random noise with standard deviation σ to the time step duration δt± of each action. For comparison, the original, unperturbed time step sizes δt+ = π/8 and δt− = π/13 which the agent was trained on are indicated by the vertical dotted lines. This type of noise tests the ability of the agent to self-correct protocols in an online fashion.
steps to the remaining isometry A [1]j1 α1 , i.e., we pad the columns of A [1]j1,α1 0,0 with the 2 2 dimensional kernel of A [2] † and interpret the resulting matrix as the two-qubit gate G 1 .
Using the above mapping, an MPS circuit with bond dimension χ = 2 n always contains at least one (n + 1)qubit gate. Thus, the N = 4 QMPS with bond dimension χ = 4 results in a circuit including a three-qubit gate. However, the native gates realized in most present-day quantum computers contain at most two-qubit unitaries. Therefore, gates acting on more than two qubits first have to be decomposed into two-and single-qubit gates. Performing the decomposition in an exact manner is usually expensive, requires the use of optimization techniques, and often leads to very deep circuits nonetheless. With the short coherence times and large error rates of current quantum devices, it therefore quickly becomes infeasible to execute MPS circuits of bond dimension χ > 2. Hence, we need alternative circuit mappings that give rise to at most two-qubit gates in the final circuit. The simplest approach is to truncate the given QMPS to a bond dimension χ = 2 MPS (see Fig. S17(b) for the corresponding circuit). However, if the truncation errors are too large, the resulting circuit will not be an accurate description of the true quantum state anymore. Several approximative methods have been proposed to bridge this gap and prepare high fidelity states while restricting to the use of only two-qubit gates [75][76][77][78][79][80][81]. Note that all of these approaches can also be applied to the QMPS ansatz. For the remainder of this work we will consider the previously described exact mappings of the full χ = 4 and truncated χ = 2 QMPS as shown in Fig. S17.

B. Additional results on the noise-robustness
This section reports further results of the QMPS circuit framework introduced in Sec. IV which is tested on the universal ground state preparation task of Sec. III A.
First, we investigate how the number of measurement shots for sampling the fidelity in Eq. (4) affects the performance of the QMPS protocols under an ideal (noisefree) simulator. To that end, we sample 1000 random initial states and compute the percentage of successfully prepared states (i.e., those runs for which the fidelity threshold is reached within 50 protocol steps). Moreover, we also store the corresponding average protocol length T and show the results in Fig. S18(a)-(b). Surprisingly, as few as 500 shots are already sufficient to reach success rates of close to unity. There are several reasons for this robust performance: First, in the cases where the agent predicts a wrong action due to sampling noise, it can easily correct for the mistake in subsequent time steps since it learned to prepare the target state from any quantum state. Second, although the quantum circuit output is noisy, we find that the subsequent neural network does not enhance the noise and still outputs reasonable values. Finally, since the optimal action is always determined by the argmax of the Q-values, noise in the output does not affect the chosen actions as long as its magnitude is sufficiently small. Hence, we can achieve high success probabilities even in the presence of sampling noise. This stands in contrast to policy gradient techniques where the network outputs the action values itself and a noisy output can therefore lead to faulty protocols.
The corresponding average protocol lengthT shown in Fig. S18(b) converges to its value under exact computations (black dashed line) only after about 10 4 shots. Notice that this is still within the feasible regime for many modern quantum devices. Note moreover that the performance of the truncated QMPS circuit (orange lines in Fig. S18) is considerably worse and indicates that we indeed require a bond dimension of χ = 4 to faithfully represent the QMPS agent.
Next, we study the effects of a combined amplitude and phase damping noise channel on the performance of the QMPS circuit. Similar to the experiments involving depolarizing noise discussed in the main text, we add the amplitude and phase damping channel after each action gate and QMPS gate, and set the error parameters equal, i.e., p = p amp = p phase . Due to the lower error rates of single qubit gates, we fix the single-qubit gate errors to p 1 = 10 −4 and plot the success rate as a function of the two and three qubit error parameters p after sampling 1000 different random initial states [ Fig. S18(c)]. Figure S18(d) displays the corresponding average number of protocol steps. For error rates p < 10 −3 we are able to reach success probabilities close to unity. However, they quickly deteriorate for larger error parameter values.
Finally, we analyze the robustness of the QMPS agent to coherent gate errors, similar to the discussion in Section III C. Coherent errors arise when the actual, executed gate differs from the gate that has to be applied. These errors can often be mitigated by calibrating the devices carefully. However, frequent calibration is expensive and therefore coherent errors can usually not be eliminated fully. We simulate coherent gate errors for each of the 12 different actions by adding mean-zero Gaussian random noise of standard deviation σ to the time step duration δt ± . In contrast to the discussion in Section III C, each action gate is fixed, although the angles of rotation δt are shifted compared to the original step size the agent was trained on. We again sample 1000 random initial states and show the state preparation success rates for varying standard deviations σ in Fig. S18(e). For each of the 1000 runs we also use a different random seed when sampling the gate noise. We observe that for standard deviations σ < 0.5, the QMPS agent is still able to self-correct the protocols and reach the target state nonetheless. However, for larger amounts of noise the agent is not capable of reliably preparing the target state for all of the initial states.
We expect that the performance of the QMPS agent can likely be improved by performing some additional optimization on the quantum device taking into account the exact noise model and error rates.