Prospects of reinforcement learning for the simultaneous damping of many mechanical modes

We apply adaptive feedback for the partial refrigeration of a mechanical resonator, i.e. with the aim to simultaneously cool the classical thermal motion of more than one vibrational degree of freedom. The feedback is obtained from a neural network parametrized policy trained via a reinforcement learning strategy to choose the correct sequence of actions from a finite set in order to simultaneously reduce the energy of many modes of vibration. The actions are realized either as optical modulations of the spring constants in the so-called quadratic optomechanical coupling regime or as radiation pressure induced momentum kicks in the linear coupling regime. As a proof of principle we numerically illustrate efficient simultaneous cooling of four independent modes with an overall strong reduction of the total system temperature.

The radiation pressure effect of light onto the motion of mechanical resonators has been extensively employed to bring such macroscopic systems towards the quantum ground state [1][2][3][4][5][6][7][8][9][10][11] . In a standard approach, the aim is to isolate a single vibrational mode and bring it to a state where the only relevant motion is given by the zero-point fluctuations. Cold-damping is one of the used techniques, where one detects motionally-induced phase changes in the cavity output and an electronic feedback loop is implemented to dynamically modify the cavity drive such as to produce an extra optical damping effect [12][13][14][15][16][17][18][19] . Alternatively, in the good cavity limit where the photon loss rate is smaller than the mechanical frequency, the resolved sideband technique can be implemented by detuning the drive to the cooling sideband [20][21][22][23][24] . As the effect stems from the inherent time delay between the action of the mechanical resonator onto the cavity field and the back-action of light, this can be seen as a sort of automatic cavity induced feedback. Both techniques are devised and have been successfully applied for single vibrational mode cooling. However, it is interesting to devise an alternative technique that can induce partial to full refrigeration of the mechanical resonator, i.e. to simultaneously cool a multitude of vibrational modes into which the thermal energy is distributed. An impediment is that the detected output signal only gives information on a generalized collective quadrature but not on all modes. This leads to efficient cooling of some collective mode (for example center of mass) while some collective modes become dark and remain in a high temperature state. It has been recently pointed out that some strategies such as multimode cold-damping could in principle lead to sympathetic cooling of many modes via disorder induced coupling between bright and dark modes 25 .
Here, we propose a machine learning approach towards devising a strategy capable of providing refrigeration of the classical motion of a mechanical resonator based on the feedback obtained from the detection of a single optical mode. While the detected optical mode only gives information on a collective generalized quadrature obtained as a linear combination of individual mode displacements, the procedure is optimized such as at any instant in time a compromise is made between efficiently cooling a particular target mode while not affecting the others too much. We provide proof-of-principle multi-mode numerical simulations using a neural network parametrized policy trained by a reinforcement learning algorithm to generate the feedback signal capable of simultaneously extracting thermal energy from four distinct modes of a single mechanical resonator.
Machine learning techniques have been recently applied to various applications in quantum physics ranging from the identification of phases in many-body systems, predicting ground-state energies for electrostatic potentials, active learning approaches to propose and optimize experimental setup configurations and towards applications for quantum control and quantum-error correction [26][27][28][29][30][31][32][33][34] . In particular, a few studies 26,34,35 successfully applied the technique of reinforcement learning with neural networks 36 . This approach originates from the idea, to let an intelligent agent that observes its environment choose an action, that is determined by a given policy trying to optimize a particular reward and/or minimize a punishment.
We employ such a reinforcement learning technique for optically assisted cooling of the classical thermal state of a multi-mode mechanical resonator system [37][38][39] . The learning technique allows one to acquire a nonlinear function that chooses a feedback action that will be applied on the dynamical system upon taking the full or partial measured state of the system as an input. The training of this function that is given by a dense neural network is obtain by trial and error and quantified by an increased reward that is obtained by successfully reducing the energy of the resonators.
The physical systems considered are depicted in Fig. 1. The mechanical resonator is subject to environmental noise described by a standard Brownian motion stochastic force leading to thermalization at some equilibrium temperature T. The feedback action is implemented via the radiation pressure force, i.e. photon kicks either from one or two sides. The induced damping is straightforward in the two-sided kicking case [illustrated in Fig. 1a]: the read-out of motion is followed by appropriate kicking action from the side towards which the resonator is moving. However, one-sided kicking [illustrated in Fig. 1b] already suffices allowing setups such as the cavity optomechanical platform pictured in Fig. 1c. The typical weak free space photon-phonon interaction can also be drastically increased by the filtering of the action through the high-finesse optical cavity. Such a situation is characterized by a linear coupling of the photon number to the membrane's displacement and has been extensively studied in single mode cooling via cavity time delayed effects 12 or by implementation of cold damping techniques 12 especially in the bad cavity regime. The membrane-in-the-middle 40-42 scenario in Fig. 1d,e corresponds to a quadratic coupling in displacement leading to the possibility of optically modulating the mechanical oscillation frequency 42 . We describe in Fig. 1e a possible approach for feedback cooling via cavity field detection and neural network assisted feedback.
We will consider the bad cavity case where losses are large compared to the mechanical resonator's vibration frequencies such that the cavity back-action is negligible. In such a case, the situations described in Fig. 1b,c are physically equivalent with the difference that in Fig. 1c the action of a single photon is multiplied by a large number roughly proportional to the finesse of the cavity. We also distinguish between a parametric regime with quadratic coupling implemented in the membrane-in-the-middle setup and the linear coupling regime realizable with a single-end mirror cavity or in free space. First we analyze the performance of a neural network suggested set of actions onto the cooling of a single mode via parametric modulation of the oscillation frequency: we describe the shape of the action and numerically show the efficient reduction of energy from the initial thermal distribution. We then apply the technique to the linear cooling of four distinct modes of the resonator and find a more complex set of actions required for efficient simultaneous cooling of all four modes (with limitations arising due to the numerical complexity of the simulations).

Model
We consider a membrane resonator with a few modes of oscillations of frequencies ω j (where = … j N 1, ). We start with a quantum formulation of the system's dynamics aimed at future treatments of cooling in the presence of quantum noise. However the current formulation aims only at the reduction of classical thermal noise and is therefore obtained by inferring the equivalent classical stochastic equations of motion. The Hamiltonian for the collection of modes is written as , in terms of dimensionless position and momentum quadratures q j and p j for each independent membrane oscillation mode. The effect of the thermal reservoir can be easily included in a set of equations of motion supplemented with the proper input stochastic noise terms:  (e) Simultaneous cooling of multiple oscillating modes (inset shows a few drum modes of a dielectric membrane) using feedback generated by a reinforcement trained policy, encoded in a neural network. While the illustration shows a quadratic membrane-in-the-middle setup, the validity extends to the end-mirror linear setup as well. The outgoing signal from a driven cavity carries information on the collective displacement of all membrane vibration modes. This signal is fed through a neural network and the network's suggested action is implemented as a modulation of the cavity input drive amplitude.
www.nature.com/scientificreports www.nature.com/scientificreports/ The parameter j γ describes the damping of the j's resonator mode. Its associated zero-averaged Gaussian stochastic noise term leading to thermalization with the environment can be fully described by the two-time correlation function: where Ω is the frequency cutoff of the reservoir and the thermal noise spectrum is given by . For sufficiently high temperatures ω  k T B j , the correlation function becomes a standard white noise input with delta correlations both in frequency and time. Specifically, one can where the occupancy of each vibrational mode is given by . For numerical simulations we generate a stochastic input noise as a delta-correlated Wiener increment with variance proportional to the integration time-step (see Methods) and follow an approach described in ref. 43 . For consistency we check (in the Methods) that the thermal bath indeed correctly describes the expected thermalization of an initially cold oscillator towards the equilibrium temperature T at a rate given by γ.
The momentum kicks selected by the network are encompassed in the action of the force terms F t ( ) j . This can be realized for example by the radiation pressure effect of a laser beam, modulated by a device like an AOM (acousto-optic modulator). Here, forces acting on different resonators given by F j and ′ F j for j j ≠ ′ differ only by a constant multiplication factor as they are all obtained from the same quantity (the output field).
To amplify the effect of the action force onto the mechanical resonator one can utilize optical cavities. A cavity also allows control over the coupling by placing the membrane either in a node (quadratic coupling) or anti-node (linear coupling) of the cavity mode. The Hamiltonian is now modified by the addition of the free cavity mode 2 . The amplification effect of the light field amplitude can be seen from the relation connecting the driving amplitude to the input laser power P t ( ) 0 through the left mirror with losses at rate L κ . For high-finesse cavities photons perform many round trips before leaking out through the mirrors resulting in a large momentum transfer onto the mirror: this can be seen by taking the limit of small L κ resulting in a large value of a(t) for a given t ( ) 0 P . Notice that we considered a double-sided cavity with left κ L and right R κ decay rates adding to the total loss rate κ κ κ = + L R . The coefficients g j (1) and g j (2) are the linear and quadratic per photon optomechanical coupling rates corresponding to the two situations depicted in Fig. 1c,d, respectively. While the cavity field amplitude inherently depends on the displacement of the mechanical mode, we will assume the unresolved sideband regime where this dependence is weak. Moreover, we are interested in the classical problem i.e. in simulating the proper set of actions that results in the shrinking of an initial large thermal distribution for the total energy of the oscillator. To this end we only consider the trivial dynamics of the cavity field classical amplitude α = 〈 〉 t a t ( ) ( ) which follows the driving field as We can then reduce the dynamics of the system to 2 (the cavity field α(t) playing the role of the action delivering the cooling momentum kicks to the mechanical oscillators). As noted before, as the actions are obtained from the same cavity field intensity, they only differ by the multiplicative g j (1) factor. Notice also that this configuration strongly resembles a cold damping approach 12-14 . In contrast, for a quadratic coupling Hamiltonian, the changes in the momentum are of a very different nature as the cavity periodically modulates the oscillation frequencies of each mode.
To provide the neural network feedback onto the motional dynamics, we use the inferred q and q  at a given time − Δ t t as input values for the neural network [see Fig. 1]. The trained network then selects the appropriate action by choosing the value of E t ( ) (from a finite number of possible values) to be acted upon the system. The size of the time-step Δt is chosen such that 1/  t j ω Δ for all j to minimize the error in the numerical integration. For a given drive amplitude, the set of actions on the different modes will be different according to the values of the optomechanical couplings (as they are proportional to g t ( ) 2 ). We then use the Runge-Kutta (2020) 10:2623 | https://doi.org/10.1038/s41598-020-59435-z www.nature.com/scientificreports www.nature.com/scientificreports/ fourth-order method (RK4) for the numerical integration of the dynamical system where we iteratively sum for each time step. Additionally, at each time step we inject the measured parameters of the dynamical system as input data into the nonlinear function formed by the neural network to predict on the action and thereby the momentum kick or cavity field strength suitable for the next time step, which is acquired from the output nodes (neurons) of the network.

Reinforcement Learning
The neural network provides a nonlinear function, that for some given input data, which harbor information about the oscillator states at a given time step t, predicts the correct action for the next time step + Δ t t that helps to reduce the overall energy of the dynamical system at later times. This function forms the neural network parametrized policy π. To obtain an optimal (or nearly optimal) policy we employ the technique of reinforcement learning 36,44 and in particular a policy gradient approach 45 . Such a problem is in general referred to as a Markov decision process (MDP) 46 and described in detail in the Methods section. Here, the network acts as an agent that by observing parameters of the environment (resonator) improves its probabilistic policy that chooses the right actions a t at a given time t to increase an overall reward = ∑ R R t t (full reward over a trajectory) that is connected to the reduction of the energy of the resonator modes. The actions are chosen from a finite set (of values of different amplitudes) and realized as momentum kicks or translated into frequency shifts. As an input to the network we feed information about the state of the environment given by s q t q t ( ( ), ( )) t  = . The network outputs the probabilities a s ( ) for the actions a t that could be applied to the dynamical system. Here, the parameter θ encompasses all the weights and biases of the network. We take the action with the highest probability and apply it in the next iteration of Eq. 1 up to Eq. 3b. The probabilities π | θ a s ( ) t t can be optimized with respect to an increased reward return R t by employing an update rule for the weights and biases of the neural network, following θ θ θ ← + Δ and where  is the expectation value over all state and action sequences (full trajectories), which here is approximated by averaging over a large enough set of oscillator trajectories (training batch) and their corresponding action sequences which we have obtained from the iterative summation of the dynamical equations (RK4) and from the predictions of the neural network at each time step for various randomly chosen initial conditions. The learning rate is given by the parameter η and b is a baseline to suppress fluctuations of the reward gradient 45,47 . Here, the baseline is approximated by The neural network which is represented by the array θ encompassing all weights and biases, consists of an input and output layer and two hidden layers whereby the number of input neurons depends on the number measured of variables of the system while the number of output neurons depends on the number of possible output actions, respectively (see Methods). The two hidden layers consists of up to 60 to 100 neurons each. The network is densely connected and we chose "relu" (rectified linear unit) as a nonlinear function acting on each neuron in the two hidden layers. The probabilities for each action given out by the output layer are obtained by using the "softmax" nonlinear function for the output neurons. From these probabilities the action is chosen by taking the neuron index with the highest probability value in the output.

Results
Single mode cooling. In a first step we numerically simulate the time dynamics of a single oscillating mode of frequency ω initially in a thermal distribution imposed by its coupling to an environment at some temperature T. This corresponds to the following distribution of energies and the total occupation number normalized energy = + E q p ( ) 2 2 /2. We then randomly pick an initial energy value from the thermal distribution 0  by picking a random number s between zero and one. From the equipartition theorem we deduce q(0) and p(0) and train the neural network by recursively injecting sequences of | …  q t q t ( ( ), ( )) , obtained by applying the terms of Eqs. 3a, 3c and 4 recursively on the initial values, as training data into the network. A reward at each time step is only given when the action reduced the energy of the resonator at a given time step with respect to the previous time. The reward is defined by where E t is the total energy at time t. The reward gets larger when the energy separation between the current and initial energy increases therefore optimizing the effective cooling rate (see Methods). The application of the reward to the single mode cooling is exemplified for the quadratic coupling configuration illustrated in Fig. 1d. The cooling dynamics is exemplified both as amplitude decreases Fig. 2a and in phase space Fig. 2b on two trajectories corresponding to two different initial states randomly picked from a thermal initial distribution with average occupancy = n 100. The action sequences of the network in Fig. 2c show a periodic structure matching the frequency of the resonator mode, which is more visible in the zoom-in plot provided in Fig. 2e. There one can Scientific RepoRtS | (2020) 10:2623 | https://doi.org/10.1038/s41598-020-59435-z www.nature.com/scientificreports www.nature.com/scientificreports/ follow the time dynamics of the applied action and the effect onto both the position and momentum, which in total for all trajectories results in the reduction of the average energy as presented in Fig. 2d.
While the training of the neural network to produce an optimal policy is done with training batches of 80 trajectories each with 4000 time steps (already approaching a low energy steady state as presented in Fig. 2a,b), we test the stability of the cooling policy by applying the trained network on a sample of thousands of trajectories with an extended time range of 20000 time steps. These results of thousands of sample trajectories are presented as initial and final phase space distributions in Fig. 2f and as a histogram of the energy distributions in Fig. 2g. The injected thermal noise in all plots in Fig. 2 corresponds to a thermal occupation number of n 100 ≈ and a thermalization rate of γ/ 4 10 5 ω = × − . The choice of the initial thermal state is however arbitrary and with equal computational power one can also describe the dynamics of oscillators initially populated with more than 10 5 quanta. These results show that the policy reduces the energy of the mechanical mode and converges at a steady state irrespective of the initial high energy state derived from the thermal distribution. Additionally, the larger data set does not show any divergent outliers suggesting that convergence has been reached.
Simultaneous cooling of many modes. We display the generality of this approach by applying the network to find a strategy to cool up to four modes simultaneously. As in the case of the single mode cooling, we apply a single force on the mirror: this poses a challenge as a good cooling strategy for a given mode might actually lead to the heating of the other modes. In general, owing to this challenge, a simultaneous cooling strategy has an increased complexity in the choice of the action sequences, which leads to overall slower cooling rates. The same principles for cooling a single resonator are applied to cooling four modes subjected to the same actions by the network as presented in Fig. 3. Here, we use the setup configuration presented in Fig. 1b. While the four modes have different frequencies ω j and coupling strengths g j they are subjected to the same time sequence of actions delivered by intensity variations of an impinging laser beam in free space or via the field intensity α | | t ( ) 2 in the cavity.
The input to the network is given by as a collective position coordinate and its derivative. The quantities can be obtained for example from an interferometer that is sensitive to the fluctuations on the membrane (see Fig. 1e) or via homodyne detection. The derivation of the initial condition is described in the Methods section.
Here, the agent needs to find a strategy that simultaneously cools the center of mass motion as well as all of the relative mode dynamics. As an example the trajectories of the four modes are presented in Fig. 3a,b, where the cooling results from the corresponding actions presented in Fig. 3c with magnified view shown in Fig. 3c. In contrast to the action sequence imposed on the trajectory of a single resonator presented in Fig. 2c, which basically shows a periodic signal matching the frequency of the oscillator, here we find a more complex signal with a quasi periodic pattern. The change of the average energy of the four resonators as a function of time is presented in Fig. 3f. In Fig. 3e the initial values obtained from a Boltzmann distribution and final phase space values of a thousand trajectories for all oscillators are presented. A histogram of the sum of their individual energies is given in Fig. 3g. These results show that all four resonators can be simultaneously cooled down to lower temperatures, that differ by orders of magnitude from their initial values and thereby exemplify the strength of this adaptive approach.

Discussions and outlook
We have shown results of numerical simulations for the simultaneous cooling of a few degrees of freedom of a vibrating mechanical resonator. The feedback action has been realized via the reinforcement learning technique implemented on a neural network. There is a variety of other optimization methods that could obtain similar results. For example, stochastic optimization methods such as hill climbing, random walks or genetic algorithms 36 . It has been recently shown that evolution strategies (ES) offer a similar performance and efficiency as RL 48 . We have selected RL and especially the policy gradient method to approach this problem due to its efficiency when a large continuous or quasi continuous set of states is present 49 .
Simultaneous cooling of a few modes indicates the possibility of partial or full refrigeration of mechanical resonators via optical control. It is remarkable that the network can perform efficient cooling of many modes while only being fed information of a time-evolving collective displacement quadrature. This is owed to the fact that the designed strategy optimizes single mode cooling at every instance in time while keeping the heating of all other modes to small values. The described procedure works both inside and outside optical cavities and both in linear or nonlinear regimes therefore being easily adaptable to new systems. The technique could be easily extended to cool a number of oscillators or a number of particles trapped inside optical cavities or with tweezers. While the present treatment considers classical stochastic dynamics, a full quantum theory of neural network aided cooling will be tackled in the future that might also lead towards feedback production of squeezed or squashed states. In this regard, for both single and many oscillation modes, we plan to analyze the efficiency of neural network cooling in comparison with standard cold-damping optical cooling. It is expected that a Fourier analysis of the action function indicated by the network could hint towards feedback implementations that could surpass existing techniques, especially at the level of many degrees of freedom.

Methods initial conditions. We assume an initial Boltzmann distribution of energies
where s is a value between 0 and 1. We set where φ j is a random number between 0 and 1 for all ∈ … j n 1, , . thermalization dynamics. Let us describe the numerical procedure for simulating the action of a thermal environment onto the state of the mechanical resonator. We consider the equations of motion which we rewrite as equations of differential forms where the noise dW(t) is included as a Wiener process. Numerically, this can be realized by Δ ∝ Δ W t t N ( ) (0, 1), where N(0, 1) describes a normally distributed random variable of unit variance -consequently the Wiener increment is normally distributed with a variance equal to the numerical time increment Δt. As a numerical check we simulate the thermalization of an initially cold mechanical mode under the action of an environment with rate γ ω = × − 4 10 5 and at an effective occupancy n 100 = . The analytical Boltzmann distribution for this occupancy is shown in Fig. 4a as black dots while the final state obtained from the numerical integration is represented by the red dots (blue dots in the middle are the initial state). In Fig. 4b the agreement between the numerical simulation and the Boltzmann distribution is illustrated as a histogram of energy states. The time evolution is shown in Fig. 4c as dynamics for the position and in Fig. 4d for the total energy showing it approaching = n 100 in the long time limit.
Additionally, in Fig. 4e we present the learning process for the results presented in Fig. 2 for a single mode and for four modes as presented in Fig. 3, where the increase in the average reward over 400 epochs each with batches of 80 trajectories is shown.  www.nature.com/scientificreports www.nature.com/scientificreports/ Markov decision process. The reinforcement learning procedure presented above can be fully described by a discrete time stochastic control process. Since in this process the agent selects an action based on the stochastic policy a s ( ) π | θ which is only dependent on the current observation of the given state s, the problem is described by a Markov decision process (MDP) 46 . Formally, a Markov decision process is a 4-tuple S A P R ( , , , ), where S is the set of states, A forms the set of possible actions, P S A S : [ 0, 1] × × → is the transition function between states and × × → R R S A S : is the reward function as described above with min max being the continuous set of possible rewards.
In the case of a single resonator mode the state space is given by Fa (0, ( )) the force term. In the case the random noise contribution ξ is zero the transition function is deterministic and P s a s , we only obtain partial information of the state which originally is described by the phase space vector … … q q p p ( , , , , , ) n n 1 1 . Here we need a generalization of an MDP which is given by a partially observable Markov decision process (POMDP), where the agent cannot observe the full state. Here, we additionally have the sets Ω which describes the set of observations and O describing the set of conditional observation probabilities.
A pseudo code to implement reinforcement learning (RL) is presented in the following: . We use the Keras package for Python and the Theano framework 50 to realize the neural network and the reinforcement learning procedure 51 . The network and training parameters are presented in Table 1.