Gait switching and targeted navigation of microswimmers via deep reinforcement learning

Swimming microorganisms switch between locomotory gaits to enable complex navigation strategies such as run-and-tumble to explore their environments and search for specific targets. This ability of targeted navigation via adaptive gait-switching is particularly desirable for the development of smart artificial microswimmers that can perform complex biomedical tasks such as targeted drug delivery and microsurgery in an autonomous manner. Here we use a deep reinforcement learning approach to enable a model microswimmer to self-learn effective locomotory gaits for translation, rotation and combined motions. The Artificial Intelligence (AI) powered swimmer can switch between various locomotory gaits adaptively to navigate towards target locations. The multimodal navigation strategy is reminiscent of gait-switching behaviors adopted by swimming microorganisms. We show that the strategy advised by AI is robust to flow perturbations and versatile in enabling the swimmer to perform complex tasks such as path tracing without being explicitly programmed. Taken together, our results demonstrate the vast potential of these AI-powered swimmers for applications in unpredictable, complex fluid environments. Biomedical applications of artificial microswimmers rely on efficient navigation strategies within complex and unpredictable fluid environments. Here, the authors use artificial intelligence to model and design microswimmers that are capable of self-learning efficient navigation strategies by adaptively switching between different locomotory gaits.


S
wimming microorganisms have evolved versatile navigation strategies by switching their locomotory gaits in response to their surroundings 1 .Their navigation strategies typically involve switching between translation and rotation modes such as run-and-tumble and reverse-and-flick in bacteria [2][3][4][5] , as well as run-stop-shock and run-and-spin in eukaryotes 6,7 .Such an adaptive, multimodal gait-switching ability is particularly desirable for biomedical applications of artificial microswimmers such as targeted drug delivery and microsurgery [8][9][10][11][12] , which require navigation towards target locations in biological media with uncontrolled and/or unpredictable environmental factors [13][14][15] .
Pioneering works by Purcell and subsequent studies demonstrated how simple reconfigurable systems with ingenious locomotory gaits can generate net translation and rotation, given the stringent constraints for locomotion at low Reynolds numbers 16 .Yet, the design of locomotory gaits becomes increasingly intractable when more sophisticated maneuvers are required or environmental perturbations are present.Existing microswimmers are therefore typically designed with fixed locomotory gaits and rely on manual interventions for navigation 8,[17][18][19][20][21] .It remains an unresolved challenge in developing microswimmers with adaptive locomotory strategies similar to that of biological cells that can navigate complex environments autonomously.Modular microrobotics and the use of soft active materials 22,23 have been proposed to address the challenge.
More recently, the rapid development of artificial intelligence (AI) and its applications in locomotion problems [24][25][26][27][28][29] have opened different paths towards designing the next generation of smart microswimmers 30,31 .Various machine learning approaches have enabled the navigation of active particles in the presence of background flows 32,33 , thermal fluctuations 34,35 , and obstacles 36 .As minimal models, the microswimmers are often modeled as active particles with prescribed self-propelling velocities and certain degrees of freedom for speed variation and re-orientation.However, the complex adjustments in locomotory gaits required for such adaptations are typically not accounted for.Recent studies have begun to examine how different machine learning techniques enable reconfigurable microswimmers to evolve effective gaits for self-propulsion 37 and chemotactic repsonse 38 .
Here, we combine reinforcement learning (RL) with artificial neural network to enable a simple reconfigurable system to perform complex maneuvers in a low-Reynolds-number environment.We show that the deep RL framework empowers a microswimmer to adapt its locomotory gaits in accomplishing sophisticated tasks including targeted navigation and path tracing, without being explicitly programmed.The multimodel gait switching strategies are reminiscent of that adopted by swimming microorganisms.Furthermore, we examine the performance of these locomotion strategies against perturbations by background flows.The results showcase the versatility of AI-powered swimmers and their robustness in media with uncontrolled environmental factors.

Results and discussion
Model reconfigurable system.We consider a simple reconfigurable system consisting of three spheres with radius R and centers r i (i = 1, 2, 3) connected by two arms with variable lengths and orientations as shown in Fig. 1a.This setup generalizes previous swimmer models proposed by Najafi and Golestanian 39 and Ledesma-Aguilar et al. 40 by allowing more degrees of freedom.The interaction between the system and the surrounding viscous fluid is modeled by low Reynolds number hydrodynamics, imposing stringent constraints on the locomotive capability of the system.Unlike the traditional paradigm where the locomotory gaits are prescribed in advance [39][40][41][42][43][44] , here we exploit a deep RL framework to enable the system to self-learn a set of locomotory gaits to swim along a target direction, θ T .We employ a deep neural network based on the Actor-Critic structure and implement the Proximal Policy Optimization (PPO) algorithm 29,45 to train and update the agent (i.e., AI) in charge of the decision making process (Fig. 1b).The deep RL framework here extends previous studies from discrete action spaces to continuous action spaces 32,35,37,46 , enhancing the swimmer's capability in developing more versatile locomotory gaits for complex navigation tasks (see the "Methods" section for implementation details of the Actor-Critic neural network and PPO algorithm).
Hydrodynamic interactions.The interaction between the spheres and their surrounding fluid is governed by the Stokes equation ( ∇ p = μ∇ 2 u, ∇ ⋅ u = 0).Here, p, μ and u represent, respectively, the pressure, dynamic viscosity, and velocity field.In this low Reynolds number regime, the velocities of the spheres V i and the forces F i acting on them can be related linearly as where G ij is the Oseen tensor [47][48][49] given by Here, I is the identity matrix and rij ¼ ðr i À r j Þ=jr i À r j j denotes the unit vector between spheres i and j.The torque acting on the sphere i is calculated by T i = r i × F i .The rate of actuation of the arm lengths _ L 1 , _ L 2 and the intermediate angle _ θ 31 can be expressed in terms of the velocities of the spheres V i .The kinematics of the swimmer is fully determined upon applying the force free (∑ i F i = 0) and torque-free (∑ i T i = 0) conditions.The Oseen tensor hydrodynamic description is valid when the spheres are not in close proximity (R ≪ L).We therefore constrain the arm and angle contractions such that 0.6L ≤ L 1 , L 2 ≤ L and 2π/3 ≤ θ 31 ≤ 4π/3.
The actuation rate of the arm lengths _ L 1 ; _ L 2 can be expressed in terms of the relative velocities of the spheres parallel to the arm orientations: The actuation rate of the intermediate angle _ θ 31 can be expressed in terms of the relative velocities of the spheres perpendicular to the arm orientations: where _ θ 1 and _ θ 2 are the arm rotation speeds.Together with the Oseen tensor description of the hydrodynamic interaction between the spheres, Eqs. ( 1) and (2) in the main text, and the overall force-free and torque-free conditions, the kinematics of the swimmer is fully determined.
In presenting our results, we scale lengths by the fully extended arm length L, velocities by a characteristic actuation rate of the arm V c , and hence time by L/V c and forces by μLV c (see Nondimensionalization under Supplementary methods).
Targeted navigation.We first use the deep RL framework to train the model system in swimming along a target direction θ T , given any arbitrary initial swimmer's orientation θ o .The swimmer's orientation is defined based on the relative position between the swimmer's centroid r c = ∑ i r i /3 and r 1 as θ o ¼ argðr c À r 1 Þ (Fig. 1).
In the RL algorithm, the state s ∈ (r 1 , L 1 , L 2 , θ 1 , θ 2 ) of the system is specified by the sphere center r 1 , arm lengths L 1 , L 2 , and arm orientations θ 1 , θ 2 .The observation o 2 ðL 1 ; L 2 ; θ 31 ; cos θ d ; sin θ d Þ is extracted from the state, where θ 31 is the intermediate angle and θ d = θ T −θ o is the difference between the target direction θ T and the swimmer's orientation θ o ; note that the angle difference is expressed in terms of ðcos θ d ; sin θ d Þ to avoid discontinuity in the orientation space.The AI decides the swimmer's next action based on the observation using the Actor neural network: for each action step Δt, the swimmer performs an action a 2 ð _ L 1 ; _ L 2 ; _ θ 31 Þ by actuating its two arms, leading to swimmer displacement.To quantify the success of a given action, the reward is measured by the displacement of the swimmer's centroid along the target direction, r t ¼ ðr We divide the training process into a total of N e episodes, with each episode consisting of N t = 150 learning steps.To ensure a full exploration of the observation space o, both the initial swimmer state s and the target direction θ T are randomized in each episode.Based on the training results after every 20 episodes, the critic neural network updates the AI to maximize the expected long-term rewards E[R t=0 |π θ ], where π θ is the stochastic control policy, R t ¼ ∑ 1 t 0 γ t 0 Àt r t 0 is the infinite-horizon discounted future returns, and γ is the discount factor measuring the greediness of the algorithm 45,50 .A large discount factor γ = 0.99 is set here to ensure farsightedness of the algorithm.As the episodes proceed, the Actor-Critic structure progressively trains the AI and thereby enhances the performance of the swimmer.
In Fig. 2 (Supplementary Movie 1) we visualize the navigation of a trained swimmer along a target direction θ T , given a substantially different initial orientation, θ o .The swimmer's targeted navigation is accomplished in three stages: (1) in the initial phase (blue curve and regime), the swimmer employs "steering" gaits primarily for re-orientation, followed by ( 2) "transition" phase (red curve and regime) in which the swimmer continues to adjust its direction while self-propelling, before reaching (3) the "translation" phase (green curve and regime), in which the re-orientation is complete and the swimmer simply self-propels along the target direction.This example illustrates The swimmer is trained to swim along a target direction θ T .b Schematic of Actor-Critic neural networks.Both networks consist of three sets of layers (input layer, hidden layer, and output layer).Each layer is composed of neurons (marked as nodes).The weights of the neural network are illustrated as links in between the nodes.The input layer has the same dimension as the observation.The three linear hidden layers have the dimension of 64,32,32, respectively.The output layer dimension of the actor network is the same as the action space dimension, whereas the output layer of the actor network has only 1 neuron.We discuss the general idea as follows: based on the current observation, a reinforcement learning agent decides the next action using the Actor neural network.The next action is then evaluated by the Critic neural network to guide the training process.The swimmer performs the action advised by the agent and interacts with the hydrodynamic environment, leading to movements that constitute the next observation and reward.Both the Actor and Critic neural network are updated periodically to improve the overall performance.See more details in the "Methods" section.how an AI-powered reconfigurable system evolves a multimodal navigation strategy without explicitly programmed or relying on any prior knowledge of low-Reynolds-number locomotion.We next analyze the locomotory gaits in each mode in the evolved strategy.
Multimodal locomotory gaits.Here we examine the details of the locomotory gaits acquired by the swimmer for targeted navigation in the steering, transition, and translation modes.We distinguish these gaits by visualizing their configurational changes in the three-dimensional (3D) configuration space of the swimmer (L 1 , L 2 , θ 31 ) in Fig. 3.Here we utilize an example of a swimmer navigating towards a target direction with |θ d | > π/2 to illustrate the switching between different locomotory gaits (Fig. 3a), Supplementary Movies 2 and 3).The swimmer needs to re-orient itself in the counter-clockwise direction in this example; an example for the case of clockwise rotation is included in the Supplementary Note 1 (Supplementary Fig. 1, Movies 7 and 8).The dots in Fig. 3a represent configurations at different action steps.The configurations for the steering (blue dots), transition (red dots), and translation (green dots) gaits are clustered in different regions in the configuration space.A representative sequence of configurational changes for each mode of gaits are shown as solid lines to aid visualization (Fig. 3a).
We further examine the evolution of L 1 , L 2 , and θ 31 using the representative sequences of configurational changes identified in Fig. 3a for each mode of gaits.For the steering gaits (Fig. 3b, blue lines and Fig. 3d, blue box), the swimmer repeatedly extends and contracts L 2 and θ 31 , but keeps L 1 constant (the left arm rests in the fully contracted state).The steering gaits thus reside in the L 2 −θ 31 plane in Fig. 3a (blue line).The large variation in θ 31 generates net rotation, substantially re-orientating the swimmer orientation with a relatively small net translation (Fig. 3c).For the transition gaits (Fig. 3b, red lines and Fig. 3d, red box), the swimmer repeatedly extends and contracts all L 1 , L 2 and θ 31 , leading to significant amounts of both net rotation and translation (Fig. 3c).In the configuration space (Fig. 3a), the transition gaits tilt into the L 1 −L 2 plane with an average θ 31 less than π (red line).Compared with the steering gaits, the variation of θ 31 becomes more restricted (Fig. 3b), resulting in smaller net rotation for fine tuning of the swimmer's orientation in the transition phase.Finally, for the translation gaits (Fig. 3b, green lines and Fig. 3d, green box), the swimmer's orientation is aligned with the target direction (θ d ≈ 0); the swimmer repeatedly extends and contracts L 1 and L 2 , while keeping θ 31 close to π (i.e., all three spheres of the swimmer are aligned), resembling the swimming gaits of Najafi-Golestanian swimmers 39,51 .In the configuration space (Fig. 3a), the translation gaits reside largely in the L 1 −L 2 plane with an approximately zero average θ 31 , generating the maximum net translation with minimal net rotation (Fig. 3c).The details of gaits categorization are summarized under Supplementary methods.
It is noteworthy that the multimodal navigation strategy emerges solely from the AI without relying on prior knowledge of locomotion.The switching between rotation, transition, and translation gaits is analogous to the switching between turning and running modes observed in bacterial locomotion 2,5 .These results demonstrate how an AI-powered swimmer, without being explicitly programmed, self-learns complex locomotory gaits from rich action and configuration spaces and undergoes autonomous gait switching in accomplishing targeted navigation.
Performance evaluation.Here we investigate the improvement of swimmer's performance with increased number of training episodes N e .At initial stage of training with a small N e , the swimmer may fail to identify the right sets of locomotory gaits to achieve targeted navigation due to insufficient training.Continuous training with increased number of episodes would enable the swimmer to identify better locomotory gaits to complete navigation tasks.Here we measure the improvement of swimmer's performance with increased N e by three locomotion tests: (1) Random target test: the swimmer is assigned a target direction selected randomly from a uniform distribution in [0, 2π]; (2) Rotation test: the swimmer is assigned a targeted direction with a large angle of difference with swimmer's orientation (i.e., θ d = ± π/2); (3) Translation test: the swimmer is assigned a target direction equal to the swimmer's orientation (i.e., θ d = 0).A test is considered to be successful if the swimmer travels along the  target direction for a distance of 5 unit in 10,000 action steps.These tests ensure that the trained swimmer acquires a set of effective locomotory gaits to swim along any specified direction with robust rotation and translation.
We consider the success rates of the three tests over 100 trials (Fig. 4).For N e = 3 × 10 4 , success rates of around 90% are obtained for the three tests.When N e is increased to 9 × 10 4 , the swimmer masters translation with a 100% success rate but still needs more training for rotation.When N e is increased further to 15 × 10 4 , the swimmer obtains 100% success rates for all tests.This result demonstrates the continuous improvement in the robustness of targeted navigation with increased N e up to 15 × 10 4 .As we further increase N e , we found the relationship between N e and performance to be non-monotonic.For a total training episodes much greater than N e = 15 × 10 4 , the overall success rate will begin to drop and eventually fluctuate around 95%.We selected the trained result at N e = 15 × 10 4 for the best overall performance.
To better understand the swimmer's training process, we also varied the number of steps in each episodes, N l .For a range from 100 to 300 and a fixed total episodes N e , we found N l = 150 provides the most efficient way to balance translation and rotation and require least amount of action steps to complete both the rotation and translation tests.We remark that, when N l = 100, the swimmer was only able to translate but not to rotate, indicating the significant role N l plays in learning.
Lastly, we remark that the swimmer appears to require more training, both in N e and N l , to learn rotation compared to translation.This may be attributed to the inherit complexity of rotation gaits, where the swimmer needs to actuate its intermediate angle in addition to the actuation of the two arms required in translation gaits.
Path tracing-"SWIM".Next we showcase the swimmer's capability in tracing complex paths in an autonomous manner.To illustrate, the swimmer is tasked to trace out the English word "SWIM" (Fig. 5, Supplementary Movie 4).We note that the hydrodynamic calculations required to design the locomotory gaits to trace such complex paths become quickly intractable as the complexity increases.Here, instead of explicitly programming the gaits of the swimmer, we only select target points (p i , i = 1, 2, . . ., 17, red spots in Fig. 5) as landmarks and require the swimmer to navigate towards these landmarks with its own AI, with the target directions at action step t + 1 given by θ T tþ1 ¼ argðp i À r c t Þ.The swimmer is assigned with the next target point p i+1 when its centroid is within a certain threshold (0.1 of the fully extended arm length) from p i .The completion of these multiple navigation tasks sequentially enables the swimmer to successfully trace out the word "SWIM" with a high accuracy (Fig. 5, Supplementary Movie 4).In accomplishing this task, the swimmer switches between the three modes of locomotory gaits autonomously to swim towards individual target points and turn around the corners of the path based on the AI-powered navigation strategy.It is noteworthy that the swimmer is able to navigate around some corners (e.g., at target points 4 and 6) without activating the steering gaits, which are employed for corners with more acute angles (e.g., at target points 8, 14, and 16).While past approaches based on detailed hydrodynamic calculations, manual interventions, or other control methods may also complete such tasks, here we present reinforcement learning as an alternative approach in accomplishing these complex maneuvers in a more autonomous manner.
Robustness against flows.Last, we examine the performance of targeted navigation under the influence of flows (Fig. 6a, b, Supplementary Movies 5, 6).In particular, to determine to what extent the AI-powered swimmer is capable of maintaining its target direction against flow perturbations, we use the same AIpowered swimmer trained without any background flow, and impose a rotational flow generated by a rotlet at the origin 47,48 , Fig. 6 Analysis of the performance of targeted navigation under the influence of flows.a The Artificial Intelligence powered swimmer and the Najafi-Golestanian (NG) swimmer escape from a relatively weak rotlet flow, u ∞ = −γ × r/r 3 , where γ = γe z prescribes the strength of the rotlet in the zdirection, r = |r| is the magnitude of the position vector r from the origin (γ = 0.15).The leftmost sphere of the AI-powered swimmer is marked as red and other spheres are marked as blue to indicate the swimmer's current orientation (blue dashed arrow).The NG swimmer is colored red with its orientation marked as red dashed arrows.Three sets of trajectories (dashed, dotted, and solid lines) are shown with different initial swimmer orientation θ o 0 .The AIpowered swimmer travels to the right regardless of its initial orientation whereas the trajectory for the NG swimmer is highly affected by the rotlet flow.b We compare the trajectories of the AI-powered swimmer and the NG swimmer in a strong rotlet flow (γ = 1.5).The NG swimmer completely loses control in the flow, while the AI-powered swimmer maintains its orientation towards the positive x-direction, with similar trajectories for different initial orientations.The animation of the two simulations are shown in Supplemental Movies 5 and 6.
u ∞ = −γ × r/r 3 , where γ = γe z prescribes the strength of the rotlet in the z-direction, r = |r| is the magnitude of the position vector r from the origin (see the section "Simulations of background flow" under Supplementary methods).Here the AI-powered swimmer is tasked to navigate towards the positive x-direction under flow perturbations due to the rotlet.We examine how the swimmer adapts to the background flow when performing this task.For comparison, we contrast the resulting motion of the AI-powered swimmer with that of an untrained swimmer (i.e., a Najafi-Golestanian (NG) swimmer that performs only fixed locomotory gaits without any adaptivity 39 ).Without the background flow, both swimmers self-propel with the same speed.Both swimmers are initially placed close to the rotlet with r c = −5e x and we sample their performance with three different initial orientations: θ o 0 ¼ Àπ=3, 0, and π/3, under different flow strengths.Under a relatively weak flow (γ = 0.15, Fig. 6a), Supplementary Movie 5), the AI-powered swimmer is capable of navigating towards the positive x-direction regardless of its initial orientations against flow perturbations.In contrast, the trajectories of the NG swimmer are largely influenced by the rotlet flow passively depending on the initial orientation of the swimmer.For an increased flow strength (γ = 1.5, Fig. 6b, Supplementary Movie 6), the NG swimmer completely loses control of its direction and is scattered by the rotlet into different directions again due to the absence of any adaptivity.Under such a strong flow, the AI-powered swimmer initially circulates around the rotlet but eventually manages to escape from it, navigating to the positive x-direction successfully with similar trajectories for all initial orientations.We note that the vorticity experienced by the swimmer in this case is comparable with typical re-orientation rates of the AI-powered swimmer.We also remark that when navigating under flow perturbations, the AI-powered swimmer adopts the transition gaits to constantly re-orient itself towards the positive x-direction and self-propels along that direction eventually.These results showcase the AI-powered swimmer's capability in adapting its locomotory gaits to navigate robustly against flows.

Conclusions
In this work, we present a deep RL approach to enable navigation of an artificial microswimmer via gait switching advised by the AI.In contrast to previous works that considered active particles with prescribed self-propelling velocities as minimal models 32,34,35 or simple one-dimensional swimmers 37,38,46 , here we demonstrate how a reconfigurable system can learn complex locomotory gaits from rich and continuous action spaces to perform sophisticated maneuvers.Through RL, the swimmer develops distinct locomotory gaits for a multimodal (i.e., steering, transition, and translation) navigation strategy.The AI-powered swimmer can adapt its locomotory gaits in an autonomous manner to navigate towards any arbitrary directions.Furthermore, we show that the swimmer can navigate robustly under the influence of flows and trace convoluted paths.Instead of explicitly programming a swimmer to perform these tasks in the traditional approach, the swimmer is advised by the AI to perform complex locomotory gaits and autonomous gait switching in accomplishing these navigation tasks.The multimodal strategy employed by the AI-powered swimmer is reminiscent of the run-and-tumble in bacteria 2,5 .Taken together, our results showcase the vast potential of this deep RL approach in realizing adaptivity similar to that of biological organisms for robust locomotive capabilities.Such adaptive behaviors are crucial for future biomedical applications of artificial microswimmers in complex media with uncontrolled and/or unpredictable environmental factors.
We finally discuss several possibilities for subsequent investigations based on this deep RL approach.While we demonstrate only planar motion in this work, the approach can be readily extended to three-dimensional navigation by allowing out-ofplane rotation the swimmer's arms with expanded observation and action spaces for the additional degrees of freedom.Moreover, the deep RL framework is not tied to any specific swimmers; a simple multi-sphere system is used in this work for illustration, and the same framework applies to other reconfigurable systems.We also remark that the AI-powered swimmer is able to overcome some influences of flows even though such flows were absent in the training.Subsequent investigations including the flow perturbation in the training may lead to even more powerful AI that could exploit the flows to further enhance the navigation strategies.Another practical aspect to consider is the effect of Brownian noise [52][53][54] .Specifically, the characterization of the effect of thermal fluctuations in both the training process of the swimmer and its resulting navigation performance is currently underway.In addition to flow and thermal fluctuations, other environmental factors, including the presence of physical boundaries and obstacles, may be addressed in similar manners in future studies.The deep RL approach here opens an alternative path towards designing adaptive microswimmers with robust locomotive and navigation capabilities in more complex, realistic environments.

Methods
Here we briefly explain the Proximal Policy Optimization (PPO) alogrithm we used to train our AI-powered swimmer.
In the PPO algorithm, the agent's motion control is managed with a neural network with an Actor-Critic structure.The Actor network can be considered as a stochastic control policy π θ (a t |o t ), where it generates an action a t given an observation o t following a Gaussian distribution.Here θ represents all the parameters of the actor neural network.The Critic network is used to compute the value function V ϕ by assuming the agent starts at an observation o and acts according to a particular policy π θ .The parameters in the critic network is represented as ϕ.
To effectively train the swimmer, we divide the total training process into episodes.Each episode can be considered as one round, which terminates after a fixed amount of training steps (N l = 150).To ensure fully exploration of the observation space, we randomly initialize the swimmer's geometric configurations (L 1 , L 2 , θ 1 , θ 2 ) and the target direction (θ T ) at the beginning of each episode.
At time t, the agent receives its current observation o t and samples action a t based on the policy π θ .Given a t , the swimmer interacts with its surrounding and calculates the next state s t+1 and reward r t .The next observation o t+1 extracted from s t+1 is sent to the agent for the next iteration.All the observations, actions, rewards and sampling probabilities are stored for the agent's update.The update process begins after running fix amount of episodes N E = 20 (Total training steps of an update is therefore: N = N E *N l = 3000).The goal for the update is to optimize θ so that the expected long term rewards J(π The expectation is taken with respect to each running episode, τ.Here, we use the infinite-horizon discounted returns r t ¼ ∑ 1 t 0 γ t 0 Àt r t 0 , where γ is the discount factor measuring the greediness of the algorithm.We set γ = 0.99 ensuring its farsightedness.To solve this optimization problem, we use the typical policy gradient approach estimation: ∇ θ J(π θ ).More specifically, we implemented the clipped advantage PPO algorithm to avoid large changes in each gradient update.We estimated the surrogate objective J(π θ ) by clipping the probability ratio r(θ) times the advantage function Ât .The probability ratio measures the probability of selecting an action for the current policy over the old policy (rðθÞ ¼ π θ ðajoÞ N 1 π θ old ðajoÞ N 1 ).The advantage function Ât describes the relative advantage of taking an action a based on an observation o over a randomly selected action and is calculated by subtracting the value function V N×1 from the discounted return R N×1 ( Ât ¼ R N 1 À V N 1 ).
We then update the parameters θ, ϕ via a typical gradient descent algorithm: Adam optimizer.The full detail for our implementation is included in the Algorithm 1 and 2 below.Here, K is the total epoch number.N l is the number of steps in one episode, and N is the total number of steps for each update.The PPO algorithm uses fixed-length trajectory segments τ.During each iteration, each of N A parallel actors collect T time steps of data, then we construct the surrogate loss on these N A T time steps of data, and optimize it with Adam for K epochs.
In the following we present the algorithm tables for the PPO algorithm employed in this work.We refer the readers to classical monographs for more details 45  Evaluate expected returns V N×1 using observations o N×5 and value function V ϕ 5: Compute the advantage function: Evaluate the probability for policy π θ using observations o N×5 and actions a N×3 , store the probability to π θ (a|o) N×1 7: Compute the probability ratio: rðθÞ ¼ π θ ðajoÞ N 1 π θ old ðajoÞ N 1

Fig. 1
Fig. 1 Schematics of the model microswimmer and the deep neural network with Actor-Critic structure.a Schematic of the model microswimmer consisting of three spheres with raidus R and centers r i (i = 1, 2, 3).We mark the leftmost sphere r 1 as red and the other two spheres r 2 , r 3 as blue to indicate the current orientation of the swimmer.The spheres are connected by two arms with variable lengths L 1 , L 2 and orientations θ 1 , θ 2 , where θ 31 is the intermediate angle between two arms.The swimmer's orientation θ o is defined based on the relative position between the swimmer's centroid r c = ∑ i r i /3 and r 1 asθ o ¼ argðr c À r 1 Þ.The swimmer is trained to swim along a target direction θ T .b Schematic of Actor-Critic neural networks.Both networks consist of three sets of layers (input layer, hidden layer, and output layer).Each layer is composed of neurons (marked as nodes).The weights of the neural network are illustrated as links in between the nodes.The input layer has the same dimension as the observation.The three linear hidden layers have the dimension of 64,32,32, respectively.The output layer dimension of the actor network is the same as the action space dimension, whereas the output layer of the actor network has only 1 neuron.We discuss the general idea as follows: based on the current observation, a reinforcement learning agent decides the next action using the Actor neural network.The next action is then evaluated by the Critic neural network to guide the training process.The swimmer performs the action advised by the agent and interacts with the hydrodynamic environment, leading to movements that constitute the next observation and reward.Both the Actor and Critic neural network are updated periodically to improve the overall performance.See more details in the "Methods" section.

Fig. 2
Fig. 2 Example of target navigation utilizing three distinct locomotory gaits.The Artificial Intelligence powered swimmer switches between distinct locomotory gaits (steering, transition, translation) advised by the reinforcement learning algorithm to steer itself towards a specified target direction θ T (black arrow) and swim along the target direction afterwards.Different parts of the swimmer's trajectory are colored to represent the locomotion due to different locomotory gaits, where the steering, transition, and translation gaits are marked as blue, red, green, respectively.Schematics of the swimmer configurations (not-to-scale) are shown for illustrative purpose, where the leftmost sphere is marked as red and other two spheres marked as blue to indicate the swimmer's current orientation (gray arrows).The inset shows the change in swimmer's orientation θ o over action steps.An animation of this simulation is shown in Supplemental Movie 1.

Fig. 3
Fig. 3 Analysis of configurational changes revealing three distinct modes of locomotory gaits.The steering, transition, and translation gaits are marked as blue, red, green, respectively.a A 3D configuration plot for a typical simulation which the swimmer aligns with the target direction via a counterclockwise rotation, where L 1 , L 2 are the arm lengths and θ 31 is the intermediate angle.Each dot represents one specific configuration of a locomotory gait.The solid lines mark an example cycle of each locomotory gait.b The changes in the arm lengths L 1 and L 2 and the intermediate angle θ 31 with respect to the configuration number for each locomotory gait.c The average translational velocity h_ xi and rotational velocity h _ θi are calculated by averaging the centroid translation along the target direction θ T and the change of swimmer's orientation θ o over the total number of action steps for each locomotory gaits.d Representative configurations labeled with the configuration number are displayed to illustrate the configurational changes for each selected sequence of locomotory gaits for the steering (blue box), transition (red box), and translation (green box) modes.The leftmost sphere of the swimmer is marked as red and other two spheres are marked as blue to indicate the swimmer's current orientation.The gray arrows indicate the contraction/extension of the arms and the intermediate angle.For illustration, the reference frame of the configurations are rotated consistently such that the left arm of the first configuration is aligned horizontally in each sequence.The animation of counterclockwise and clockwise simulations are shown in the Supplementary Movies 2, 3 and 7, 8.

Fig. 4
Fig. 4 Analysis of the swimmer's performance with increasing number of episodes.Number of episodes N e indicates the total training time of the swimmer.Each episode during training contains a fixed amount of action steps N l = 150.a We used three tests (random target test, rotation test and translation test) to measure the swimmer's performance in a fixed number of training steps N l = 150.For all tests, the swimmer starts with a random initial configuration to ensure a full exploration of the observation space.A total of 100 trials are considered for each test with swimmers trained at different N e .A swimmer with insufficient training (3 × 10 4 episodes) may occasionally fails in the three tests (success rate ≈ 90%).At N e = 9 × 10 4 , the swimmer masters translation and improves its rotation ability.When N e increases to 1.5 × 10 5 , the swimmer obtains a 100% success rate in all tests.b Schematics of the random target test, rotation test, and translation test.The leftmost sphere is marked as red and other spheres are marked as blue to indicate the swimmer's orientation θ o (red dashed arrows).Given a random initial configuration, we test the swimmer's ability to translate along or rotate towards a target direction θ T (solid red arrows).The black arrows indicate the swimmer's intended moving direction.

Fig. 5
Fig.5Demonstration of complex navigation capability of Artificial Intelligence powered swimmer.The Artificial Intelligence powered swimmer switches between various locomotory gaits autonomously in tracing a complex trajectory "SWIM".The trajectory of the central sphere of the swimmer is colored based on the mode locomotory gaits: steering (blue), transition (red), and translation (green).The swimmer is given a lists of target points (1-17) with one target point at a time.The black arrows at each point indicate the intended direction of the swimmer.From the current target point, the swimmer determines the target direction for the next action step t + 1, θ T tþ1 and adapts the locomotory gaits based on its AI in navigating towards that direction.Schematics of the swimmer configurations (not-to-scale) are shown for illustrative purposes, where the leftmost sphere is marked as red and other two spheres are marked as blue to indicate the swimmer's current orientation.An animation of this simulation is shown in Supplemental Movie 4.

.
Algorithm 1. Environment 1: for time step t = 0, 1, ...do 2:if mod(t, N l ) = 0 then Sample action a t from policy π θ 7:Evaluate the next state s t+1 and reward r t following the swimmer's hydrodynamics 8:Compute the next observation o t+1 from state s t+1 9:if t = 0 or mod(t, N) ≠ 0 then 10:append observation o t+1 , action a t , reward r t and action sampling probability π θ (a t |o t ) to observation list o N×5 , action list a N×3 , reward list R N×1 and action sampling probability list π θ old ðajoÞ N 1 Proximal Policy Optimization, Actor-Critic, Update the Agent 1: Input: Initial policy parameter θ, initial value function parameter ϕ 2: for k = 0, 1, 2,…K do 3:Compute infinite-horizon discounted returns R N×1 4: