Abstract
Swimming microorganisms switch between locomotory gaits to enable complex navigation strategies such as runandtumble to explore their environments and search for specific targets. This ability of targeted navigation via adaptive gaitswitching is particularly desirable for the development of smart artificial microswimmers that can perform complex biomedical tasks such as targeted drug delivery and microsurgery in an autonomous manner. Here we use a deep reinforcement learning approach to enable a model microswimmer to selflearn effective locomotory gaits for translation, rotation and combined motions. The Artificial Intelligence (AI) powered swimmer can switch between various locomotory gaits adaptively to navigate towards target locations. The multimodal navigation strategy is reminiscent of gaitswitching behaviors adopted by swimming microorganisms. We show that the strategy advised by AI is robust to flow perturbations and versatile in enabling the swimmer to perform complex tasks such as path tracing without being explicitly programmed. Taken together, our results demonstrate the vast potential of these AIpowered swimmers for applications in unpredictable, complex fluid environments.
Similar content being viewed by others
Introduction
Swimming microorganisms have evolved versatile navigation strategies by switching their locomotory gaits in response to their surroundings^{1}. Their navigation strategies typically involve switching between translation and rotation modes such as runandtumble and reverseandflick in bacteria^{2,3,4,5}, as well as runstopshock and runandspin in eukaryotes^{6,7}. Such an adaptive, multimodal gaitswitching ability is particularly desirable for biomedical applications of artificial microswimmers such as targeted drug delivery and microsurgery^{8,9,10,11,12}, which require navigation towards target locations in biological media with uncontrolled and/or unpredictable environmental factors^{13,14,15}.
Pioneering works by Purcell and subsequent studies demonstrated how simple reconfigurable systems with ingenious locomotory gaits can generate net translation and rotation, given the stringent constraints for locomotion at low Reynolds numbers^{16}. Yet, the design of locomotory gaits becomes increasingly intractable when more sophisticated maneuvers are required or environmental perturbations are present. Existing microswimmers are therefore typically designed with fixed locomotory gaits and rely on manual interventions for navigation^{8,17,18,19,20,21}. It remains an unresolved challenge in developing microswimmers with adaptive locomotory strategies similar to that of biological cells that can navigate complex environments autonomously. Modular microrobotics and the use of soft active materials^{22,23} have been proposed to address the challenge.
More recently, the rapid development of artificial intelligence (AI) and its applications in locomotion problems^{24,25,26,27,28,29} have opened different paths towards designing the next generation of smart microswimmers^{30,31}. Various machine learning approaches have enabled the navigation of active particles in the presence of background flows^{32,33}, thermal fluctuations^{34,35}, and obstacles^{36}. As minimal models, the microswimmers are often modeled as active particles with prescribed selfpropelling velocities and certain degrees of freedom for speed variation and reorientation. However, the complex adjustments in locomotory gaits required for such adaptations are typically not accounted for. Recent studies have begun to examine how different machine learning techniques enable reconfigurable microswimmers to evolve effective gaits for selfpropulsion^{37} and chemotactic repsonse^{38}.
Here, we combine reinforcement learning (RL) with artificial neural network to enable a simple reconfigurable system to perform complex maneuvers in a lowReynoldsnumber environment. We show that the deep RL framework empowers a microswimmer to adapt its locomotory gaits in accomplishing sophisticated tasks including targeted navigation and path tracing, without being explicitly programmed. The multimodel gait switching strategies are reminiscent of that adopted by swimming microorganisms. Furthermore, we examine the performance of these locomotion strategies against perturbations by background flows. The results showcase the versatility of AIpowered swimmers and their robustness in media with uncontrolled environmental factors.
Results and discussion
Model reconfigurable system
We consider a simple reconfigurable system consisting of three spheres with radius R and centers r_{i} (i = 1, 2, 3) connected by two arms with variable lengths and orientations as shown in Fig. 1a. This setup generalizes previous swimmer models proposed by Najafi and Golestanian^{39} and LedesmaAguilar et al. ^{40} by allowing more degrees of freedom. The interaction between the system and the surrounding viscous fluid is modeled by low Reynolds number hydrodynamics, imposing stringent constraints on the locomotive capability of the system. Unlike the traditional paradigm where the locomotory gaits are prescribed in advance^{39,40,41,42,43,44}, here we exploit a deep RL framework to enable the system to selflearn a set of locomotory gaits to swim along a target direction, θ_{T}. We employ a deep neural network based on the ActorCritic structure and implement the Proximal Policy Optimization (PPO) algorithm^{29,45} to train and update the agent (i.e., AI) in charge of the decision making process (Fig. 1b). The deep RL framework here extends previous studies from discrete action spaces to continuous action spaces^{32,35,37,46}, enhancing the swimmer’s capability in developing more versatile locomotory gaits for complex navigation tasks (see the “Methods” section for implementation details of the ActorCritic neural network and PPO algorithm).
Hydrodynamic interactions
The interaction between the spheres and their surrounding fluid is governed by the Stokes equation ( ∇ p = μ∇^{2}u, ∇ ⋅ u = 0). Here, p, μ and u represent, respectively, the pressure, dynamic viscosity, and velocity field. In this low Reynolds number regime, the velocities of the spheres V_{i} and the forces F_{i} acting on them can be related linearly as
where G_{ij} is the Oseen tensor^{47,48,49} given by
Here, I is the identity matrix and \({\hat{{{{{{{{\bf{r}}}}}}}}}}_{ij}=({{{{{{{{\bf{r}}}}}}}}}_{i}{{{{{{{{\bf{r}}}}}}}}}_{j})/ {{{{{{{{\bf{r}}}}}}}}}_{i}{{{{{{{{\bf{r}}}}}}}}}_{j}\) denotes the unit vector between spheres i and j. The torque acting on the sphere i is calculated by T_{i} = r_{i} × F_{i}. The rate of actuation of the arm lengths \({\dot{L}}_{1}\), \({\dot{L}}_{2}\) and the intermediate angle \({\dot{\theta }}_{31}\) can be expressed in terms of the velocities of the spheres V_{i}. The kinematics of the swimmer is fully determined upon applying the force free (∑_{i}F_{i} = 0) and torquefree (∑_{i}T_{i} = 0) conditions. The Oseen tensor hydrodynamic description is valid when the spheres are not in close proximity (R ≪ L). We therefore constrain the arm and angle contractions such that 0.6L ≤ L_{1}, L_{2} ≤ L and 2π/3 ≤ θ_{31} ≤ 4π/3.
The actuation rate of the arm lengths \({\dot{L}}_{1},{\dot{L}}_{2}\) can be expressed in terms of the relative velocities of the spheres parallel to the arm orientations:
The actuation rate of the intermediate angle \({\dot{\theta }}_{31}\) can be expressed in terms of the relative velocities of the spheres perpendicular to the arm orientations:
where \({\dot{\theta }}_{1}\) and \({\dot{\theta }}_{2}\) are the arm rotation speeds. Together with the Oseen tensor description of the hydrodynamic interaction between the spheres, Eqs. (1) and (2) in the main text, and the overall forcefree and torquefree conditions, the kinematics of the swimmer is fully determined.
In presenting our results, we scale lengths by the fully extended arm length L, velocities by a characteristic actuation rate of the arm V_{c}, and hence time by L/V_{c} and forces by μLV_{c} (see Nondimensionalization under Supplementary methods).
Targeted navigation
We first use the deep RL framework to train the model system in swimming along a target direction θ_{T}, given any arbitrary initial swimmer’s orientation θ_{o}. The swimmer’s orientation is defined based on the relative position between the swimmer’s centroid r_{c} = ∑_{i}r_{i}/3 and r_{1} as \({\theta }_{{\rm {o}}}=\arg ({{{{{{{{\bf{r}}}}}}}}}_{{\rm {c}}}{{{{{{{{\bf{r}}}}}}}}}_{1})\) (Fig. 1).
In the RL algorithm, the state s ∈ (r_{1}, L_{1}, L_{2}, θ_{1}, θ_{2}) of the system is specified by the sphere center r_{1}, arm lengths L_{1}, L_{2}, and arm orientations θ_{1}, θ_{2}. The observation \(o\in ({L}_{1},{L}_{2},{\theta }_{31},\cos {\theta }_{{\rm {d}}},\sin {\theta }_{{\rm {d}}})\) is extracted from the state, where θ_{31} is the intermediate angle and θ_{d} = θ_{T}−θ_{o} is the difference between the target direction θ_{T} and the swimmer’s orientation θ_{o}; note that the angle difference is expressed in terms of \((\cos {\theta }_{{\rm {d}}},\sin {\theta }_{{\rm {d}}})\) to avoid discontinuity in the orientation space. The AI decides the swimmer’s next action based on the observation using the Actor neural network: for each action step Δt, the swimmer performs an action \(a\in ({\dot{L}}_{1},{\dot{L}}_{2},{\dot{\theta }}_{31})\) by actuating its two arms, leading to swimmer displacement. To quantify the success of a given action, the reward is measured by the displacement of the swimmer’s centroid along the target direction, \({r}_{t}=({{{{{{{{\bf{r}}}}}}}}}_{{{\rm {c}}}_{t+1}}{{{{{{{{\bf{r}}}}}}}}}_{{{\rm {c}}}_{t}})\cdot (\cos {\theta }_{{\rm {T}}},\,\sin {\theta }_{{\rm {T}}})\).
We divide the training process into a total of N_{e} episodes, with each episode consisting of N_{t} = 150 learning steps. To ensure a full exploration of the observation space o, both the initial swimmer state s and the target direction θ_{T} are randomized in each episode. Based on the training results after every 20 episodes, the critic neural network updates the AI to maximize the expected longterm rewards E[R_{t=0}∣π_{θ}], where π_{θ} is the stochastic control policy, \({R}_{t}=\mathop{\sum }\nolimits_{t^{\prime} }^{\infty }{\gamma }^{t^{\prime} t}{r}_{t^{\prime} }\) is the infinitehorizon discounted future returns, and γ is the discount factor measuring the greediness of the algorithm^{45,50}. A large discount factor γ = 0.99 is set here to ensure farsightedness of the algorithm. As the episodes proceed, the ActorCritic structure progressively trains the AI and thereby enhances the performance of the swimmer.
In Fig. 2 (Supplementary Movie 1) we visualize the navigation of a trained swimmer along a target direction θ_{T}, given a substantially different initial orientation, θ_{o}. The swimmer’s targeted navigation is accomplished in three stages: (1) in the initial phase (blue curve and regime), the swimmer employs “steering” gaits primarily for reorientation, followed by (2) “transition” phase (red curve and regime) in which the swimmer continues to adjust its direction while selfpropelling, before reaching (3) the “translation” phase (green curve and regime), in which the reorientation is complete and the swimmer simply selfpropels along the target direction. This example illustrates how an AIpowered reconfigurable system evolves a multimodal navigation strategy without explicitly programmed or relying on any prior knowledge of lowReynoldsnumber locomotion. We next analyze the locomotory gaits in each mode in the evolved strategy.
Multimodal locomotory gaits
Here we examine the details of the locomotory gaits acquired by the swimmer for targeted navigation in the steering, transition, and translation modes. We distinguish these gaits by visualizing their configurational changes in the threedimensional (3D) configuration space of the swimmer (L_{1}, L_{2}, θ_{31}) in Fig. 3. Here we utilize an example of a swimmer navigating towards a target direction with ∣θ_{d}∣ > π/2 to illustrate the switching between different locomotory gaits (Fig. 3a), Supplementary Movies 2 and 3). The swimmer needs to reorient itself in the counterclockwise direction in this example; an example for the case of clockwise rotation is included in the Supplementary Note 1 (Supplementary Fig. 1, Movies 7 and 8). The dots in Fig. 3a represent configurations at different action steps. The configurations for the steering (blue dots), transition (red dots), and translation (green dots) gaits are clustered in different regions in the configuration space. A representative sequence of configurational changes for each mode of gaits are shown as solid lines to aid visualization (Fig. 3a).
We further examine the evolution of L_{1}, L_{2}, and θ_{31} using the representative sequences of configurational changes identified in Fig. 3a for each mode of gaits. For the steering gaits (Fig. 3b, blue lines and Fig. 3d, blue box), the swimmer repeatedly extends and contracts L_{2} and θ_{31}, but keeps L_{1} constant (the left arm rests in the fully contracted state). The steering gaits thus reside in the L_{2}−θ_{31} plane in Fig. 3a (blue line). The large variation in θ_{31} generates net rotation, substantially reorientating the swimmer orientation with a relatively small net translation (Fig. 3c). For the transition gaits (Fig. 3b, red lines and Fig. 3d, red box), the swimmer repeatedly extends and contracts all L_{1}, L_{2} and θ_{31}, leading to significant amounts of both net rotation and translation (Fig. 3c). In the configuration space (Fig. 3a), the transition gaits tilt into the L_{1}−L_{2} plane with an average θ_{31} less than π (red line). Compared with the steering gaits, the variation of θ_{31} becomes more restricted (Fig. 3b), resulting in smaller net rotation for fine tuning of the swimmer’s orientation in the transition phase. Finally, for the translation gaits (Fig. 3b, green lines and Fig. 3d, green box), the swimmer’s orientation is aligned with the target direction (θ_{d} ≈ 0); the swimmer repeatedly extends and contracts L_{1} and L_{2}, while keeping θ_{31} close to π (i.e., all three spheres of the swimmer are aligned), resembling the swimming gaits of Najafi–Golestanian swimmers^{39,51}. In the configuration space (Fig. 3a), the translation gaits reside largely in the L_{1}−L_{2} plane with an approximately zero average θ_{31}, generating the maximum net translation with minimal net rotation (Fig. 3c). The details of gaits categorization are summarized under Supplementary methods.
It is noteworthy that the multimodal navigation strategy emerges solely from the AI without relying on prior knowledge of locomotion. The switching between rotation, transition, and translation gaits is analogous to the switching between turning and running modes observed in bacterial locomotion^{2,5}. These results demonstrate how an AIpowered swimmer, without being explicitly programmed, selflearns complex locomotory gaits from rich action and configuration spaces and undergoes autonomous gait switching in accomplishing targeted navigation.
Performance evaluation
Here we investigate the improvement of swimmer’s performance with increased number of training episodes N_{e}. At initial stage of training with a small N_{e}, the swimmer may fail to identify the right sets of locomotory gaits to achieve targeted navigation due to insufficient training. Continuous training with increased number of episodes would enable the swimmer to identify better locomotory gaits to complete navigation tasks. Here we measure the improvement of swimmer’s performance with increased N_{e} by three locomotion tests: (1) Random target test: the swimmer is assigned a target direction selected randomly from a uniform distribution in [0, 2π]; (2) Rotation test: the swimmer is assigned a targeted direction with a large angle of difference with swimmer’s orientation (i.e., θ_{d} = ± π/2); (3) Translation test: the swimmer is assigned a target direction equal to the swimmer’s orientation (i.e., θ_{d} = 0). A test is considered to be successful if the swimmer travels along the target direction for a distance of 5 unit in 10,000 action steps. These tests ensure that the trained swimmer acquires a set of effective locomotory gaits to swim along any specified direction with robust rotation and translation.
We consider the success rates of the three tests over 100 trials (Fig. 4). For N_{e} = 3 × 10^{4}, success rates of around 90% are obtained for the three tests. When N_{e} is increased to 9 × 10^{4}, the swimmer masters translation with a 100% success rate but still needs more training for rotation. When N_{e} is increased further to 15 × 10^{4}, the swimmer obtains 100% success rates for all tests. This result demonstrates the continuous improvement in the robustness of targeted navigation with increased N_{e} up to 15 × 10^{4}. As we further increase N_{e}, we found the relationship between N_{e} and performance to be nonmonotonic. For a total training episodes much greater than N_{e} = 15 × 10^{4}, the overall success rate will begin to drop and eventually fluctuate around 95%. We selected the trained result at N_{e} = 15 × 10^{4} for the best overall performance.
To better understand the swimmer’s training process, we also varied the number of steps in each episodes, N_{l}. For a range from 100 to 300 and a fixed total episodes N_{e}, we found N_{l} = 150 provides the most efficient way to balance translation and rotation and require least amount of action steps to complete both the rotation and translation tests. We remark that, when N_{l} = 100, the swimmer was only able to translate but not to rotate, indicating the significant role N_{l} plays in learning.
Lastly, we remark that the swimmer appears to require more training, both in N_{e} and N_{l}, to learn rotation compared to translation. This may be attributed to the inherit complexity of rotation gaits, where the swimmer needs to actuate its intermediate angle in addition to the actuation of the two arms required in translation gaits.
Path tracing–"SWIM"
Next we showcase the swimmer’s capability in tracing complex paths in an autonomous manner. To illustrate, the swimmer is tasked to trace out the English word “SWIM" (Fig. 5, Supplementary Movie 4). We note that the hydrodynamic calculations required to design the locomotory gaits to trace such complex paths become quickly intractable as the complexity increases. Here, instead of explicitly programming the gaits of the swimmer, we only select target points (p_{i}, i = 1, 2, . . . , 17, red spots in Fig. 5) as landmarks and require the swimmer to navigate towards these landmarks with its own AI, with the target directions at action step t + 1 given by \({\theta }_{{T}_{t+1}}=\arg ({{{{{{{{\bf{p}}}}}}}}}_{i}{{{{{{{{\bf{r}}}}}}}}}_{{c}_{t}})\). The swimmer is assigned with the next target point p_{i+1} when its centroid is within a certain threshold (0.1 of the fully extended arm length) from p_{i}. The completion of these multiple navigation tasks sequentially enables the swimmer to successfully trace out the word “SWIM" with a high accuracy (Fig. 5, Supplementary Movie 4). In accomplishing this task, the swimmer switches between the three modes of locomotory gaits autonomously to swim towards individual target points and turn around the corners of the path based on the AIpowered navigation strategy. It is noteworthy that the swimmer is able to navigate around some corners (e.g., at target points 4 and 6) without activating the steering gaits, which are employed for corners with more acute angles (e.g., at target points 8, 14, and 16). While past approaches based on detailed hydrodynamic calculations, manual interventions, or other control methods may also complete such tasks, here we present reinforcement learning as an alternative approach in accomplishing these complex maneuvers in a more autonomous manner.
Robustness against flows
Last, we examine the performance of targeted navigation under the influence of flows (Fig. 6a, b, Supplementary Movies 5, 6). In particular, to determine to what extent the AIpowered swimmer is capable of maintaining its target direction against flow perturbations, we use the same AIpowered swimmer trained without any background flow, and impose a rotational flow generated by a rotlet at the origin^{47,48}, u_{∞} = −γ × r/r^{3}, where γ = γe_{z} prescribes the strength of the rotlet in the zdirection, r = ∣r∣ is the magnitude of the position vector r from the origin (see the section “Simulations of background flow” under Supplementary methods). Here the AIpowered swimmer is tasked to navigate towards the positive xdirection under flow perturbations due to the rotlet. We examine how the swimmer adapts to the background flow when performing this task. For comparison, we contrast the resulting motion of the AIpowered swimmer with that of an untrained swimmer (i.e., a Najafi–Golestanian (NG) swimmer that performs only fixed locomotory gaits without any adaptivity^{39}). Without the background flow, both swimmers selfpropel with the same speed. Both swimmers are initially placed close to the rotlet with r_{c} = −5e_{x} and we sample their performance with three different initial orientations: \({\theta }_{{o}_{0}}=\pi /3\), 0, and π/3, under different flow strengths. Under a relatively weak flow (γ = 0.15, Fig. 6a), Supplementary Movie 5), the AIpowered swimmer is capable of navigating towards the positive xdirection regardless of its initial orientations against flow perturbations. In contrast, the trajectories of the NG swimmer are largely influenced by the rotlet flow passively depending on the initial orientation of the swimmer. For an increased flow strength (γ = 1.5, Fig. 6b, Supplementary Movie 6), the NG swimmer completely loses control of its direction and is scattered by the rotlet into different directions again due to the absence of any adaptivity. Under such a strong flow, the AIpowered swimmer initially circulates around the rotlet but eventually manages to escape from it, navigating to the positive xdirection successfully with similar trajectories for all initial orientations. We note that the vorticity experienced by the swimmer in this case is comparable with typical reorientation rates of the AIpowered swimmer. We also remark that when navigating under flow perturbations, the AIpowered swimmer adopts the transition gaits to constantly reorient itself towards the positive xdirection and selfpropels along that direction eventually. These results showcase the AIpowered swimmer’s capability in adapting its locomotory gaits to navigate robustly against flows.
Conclusions
In this work, we present a deep RL approach to enable navigation of an artificial microswimmer via gait switching advised by the AI. In contrast to previous works that considered active particles with prescribed selfpropelling velocities as minimal models^{32,34,35} or simple onedimensional swimmers^{37,38,46}, here we demonstrate how a reconfigurable system can learn complex locomotory gaits from rich and continuous action spaces to perform sophisticated maneuvers. Through RL, the swimmer develops distinct locomotory gaits for a multimodal (i.e., steering, transition, and translation) navigation strategy. The AIpowered swimmer can adapt its locomotory gaits in an autonomous manner to navigate towards any arbitrary directions. Furthermore, we show that the swimmer can navigate robustly under the influence of flows and trace convoluted paths. Instead of explicitly programming a swimmer to perform these tasks in the traditional approach, the swimmer is advised by the AI to perform complex locomotory gaits and autonomous gait switching in accomplishing these navigation tasks. The multimodal strategy employed by the AIpowered swimmer is reminiscent of the runandtumble in bacteria^{2,5}. Taken together, our results showcase the vast potential of this deep RL approach in realizing adaptivity similar to that of biological organisms for robust locomotive capabilities. Such adaptive behaviors are crucial for future biomedical applications of artificial microswimmers in complex media with uncontrolled and/or unpredictable environmental factors.
We finally discuss several possibilities for subsequent investigations based on this deep RL approach. While we demonstrate only planar motion in this work, the approach can be readily extended to threedimensional navigation by allowing outofplane rotation the swimmer’s arms with expanded observation and action spaces for the additional degrees of freedom. Moreover, the deep RL framework is not tied to any specific swimmers; a simple multisphere system is used in this work for illustration, and the same framework applies to other reconfigurable systems. We also remark that the AIpowered swimmer is able to overcome some influences of flows even though such flows were absent in the training. Subsequent investigations including the flow perturbation in the training may lead to even more powerful AI that could exploit the flows to further enhance the navigation strategies. Another practical aspect to consider is the effect of Brownian noise^{52,53,54}. Specifically, the characterization of the effect of thermal fluctuations in both the training process of the swimmer and its resulting navigation performance is currently underway. In addition to flow and thermal fluctuations, other environmental factors, including the presence of physical boundaries and obstacles, may be addressed in similar manners in future studies. The deep RL approach here opens an alternative path towards designing adaptive microswimmers with robust locomotive and navigation capabilities in more complex, realistic environments.
Methods
Here we briefly explain the Proximal Policy Optimization (PPO) alogrithm we used to train our AIpowered swimmer.
In the PPO algorithm, the agent’s motion control is managed with a neural network with an ActorCritic structure. The Actor network can be considered as a stochastic control policy π_{θ}(a_{t}∣o_{t}), where it generates an action a_{t} given an observation o_{t} following a Gaussian distribution. Here θ represents all the parameters of the actor neural network. The Critic network is used to compute the value function V_{ϕ} by assuming the agent starts at an observation o and acts according to a particular policy π_{θ}. The parameters in the critic network is represented as ϕ.
To effectively train the swimmer, we divide the total training process into episodes. Each episode can be considered as one round, which terminates after a fixed amount of training steps (N_{l} = 150). To ensure fully exploration of the observation space, we randomly initialize the swimmer’s geometric configurations (L_{1}, L_{2}, θ_{1}, θ_{2}) and the target direction (θ_{T}) at the beginning of each episode.
At time t, the agent receives its current observation o_{t} and samples action a_{t} based on the policy π_{θ}. Given a_{t}, the swimmer interacts with its surrounding and calculates the next state s_{t+1} and reward r_{t}. The next observation o_{t+1} extracted from s_{t+1} is sent to the agent for the next iteration. All the observations, actions, rewards and sampling probabilities are stored for the agent’s update. The update process begins after running fix amount of episodes N_{E} = 20 (Total training steps of an update is therefore: N = N_{E}*N_{l} = 3000). The goal for the update is to optimize θ so that the expected long term rewards J(π_{θ}) = E[R_{t=0}∣π_{θ}] is maximized.
The expectation is taken with respect to each running episode, τ. Here, we use the infinitehorizon discounted returns \({r}_{t}=\mathop{\sum }\nolimits_{t^{\prime} }^{\infty }{\gamma }^{t^{\prime} t}{r}_{t^{\prime} }\), where γ is the discount factor measuring the greediness of the algorithm. We set γ = 0.99 ensuring its farsightedness. To solve this optimization problem, we use the typical policy gradient approach estimation: ∇_{θ}J(π_{θ}). More specifically, we implemented the clipped advantage PPO algorithm to avoid large changes in each gradient update. We estimated the surrogate objective J(π_{θ}) by clipping the probability ratio r(θ) times the advantage function \({\hat{A}}_{t}\). The probability ratio measures the probability of selecting an action for the current policy over the old policy (\(r(\theta )=\frac{{\pi }_{\theta }{(a o)}_{N\times 1}}{{\pi }_{{\theta }_{{{{{{{\mathrm{old}}}}}}}}}{(a o)}_{N\times 1}}\)). The advantage function \({\hat{A}}_{t}\) describes the relative advantage of taking an action a based on an observation o over a randomly selected action and is calculated by subtracting the value function V_{N×1} from the discounted return R_{N×1} (\({\hat{A}}_{t}={R}_{N\times 1}{V}_{N\times 1}\)).
We then update the parameters θ, ϕ via a typical gradient descent algorithm: Adam optimizer. The full detail for our implementation is included in the Algorithm 1 and 2 below. Here, K is the total epoch number. N_{l} is the number of steps in one episode, and N is the total number of steps for each update. The PPO algorithm uses fixedlength trajectory segments τ. During each iteration, each of N_{A} parallel actors collect T time steps of data, then we construct the surrogate loss on these N_{A}T time steps of data, and optimize it with Adam for K epochs.
In the following we present the algorithm tables for the PPO algorithm employed in this work. We refer the readers to classical monographs for more details^{45}.
Algorithm 1
Environment
1:  for time step t = 0, 1, . . . do 
2:  if mod(t, N_{l}) = 0 then 
3:  Reset state s_{t} 
4:  Compute observation o_{t} 
5:  end if 
6:  Sample action a_{t} from policy π_{θ} 
7:  Evaluate the next state s_{t+1} and reward r_{t} following the swimmer’s hydrodynamics 
8:  Compute the next observation o_{t+1} from state s_{t+1} 
9:  if t = 0 or mod(t, N) ≠ 0 then 
10:  append observation o_{t+1}, action a_{t}, reward r_{t} and action sampling probability π_{θ}(a_{t}∣o_{t}) to observation list o_{N×5}, action list a_{N×3}, reward list R_{N×1} and action sampling probability list \({\pi }_{{\theta }_{{{{{{{{\rm{old}}}}}}}}}}{(a o)}_{N\times 1}\) 
11:  else 
12:  Update the Agent using Algorithm 2 
13:  end if 
14:  end for 
Algorithm 2
Proximal Policy Optimization, ActorCritic, Update the Agent
1:  Input: Initial policy parameter θ, initial value function parameter ϕ 
2:  for k = 0, 1, 2,…K do 
3:  Compute infinitehorizon discounted returns R_{N×1} 
4:  Evaluate expected returns V_{N×1} using observations o_{N×5} and value function V_{ϕ} 
5:  Compute the advantage function: \({\hat{A}}_{t}={R}_{N\times 1}{V}_{N\times 1}\) 
6:  Evaluate the probability for policy π_{θ} using observations o_{N×5} and actions a_{N×3}, store the probability to π_{θ}(a∣o)_{N×1} 
7:  Compute the probability ratio: \(r(\theta )=\frac{{\pi }_{\theta }{(a o)}_{N\times 1}}{{\pi }_{{\theta }_{{{{{{{\mathrm{old}}}}}}}}}{(a o)}_{N\times 1}}\) 
8:  Compute the clipped surrogate loss function: \({L}^{{{{{{{\mathrm{CLIP}}}}}}}}(\theta )={\mathbb{E}}[\min (r(\theta ){\hat{A}}_{t},\,{{{{{{\mathrm{clip}}}}}}}\,(r(\theta ),1\epsilon ,1+\epsilon ){\hat{A}}_{t})]\), with constant ϵ 
9:  Compute the valuefunction loss: \({L}^{{{{{{{\mathrm{VF}}}}}}}}(\phi )=\frac{1}{2}{{{{{{{\bf{E}}}}}}}}[{({R}_{N\times 1}{V}_{N\times 1})}^{2}]\) 
10:  Compute the entropy loss: L^{S} = αS[π_{θ}], with constant α 
11:  Compute the total loss: L(θ, ϕ) = −L^{CLIP}(θ) + L^{VF}(ϕ)−L^{S} 
12:  Optimize surrogate L with respect to (θ, ϕ), with K epochs and minibatch size M≤N_{A}T, with N_{A} is the number of parallel actors and T is the time step. 
13:  θ_{old} ← θ, ϕ_{old} ← ϕ 
14:  end for 
Code availability
The codes that support the findings of this study are available from the corresponding author upon reasonable request.
References
Lauga, E. & Powers, T. R. The hydrodynamics of swimming microorganisms. Rep. Prog. Phys. 72, 096601 (2009).
Berg, H. C. & Brown, D. A. Chemotaxis in Escherichia coli analysed by threedimensional tracking. Nature 239, 500–504 (1972).
Stocker, R., Seymour, J. R., Samadani, A., Hunt, D. E. & Polz, M. F. Rapid chemotactic response enables marine bacteria to exploit ephemeral microscale nutrient patches. Proc. Natl Acad. Sci. USA 105, 4209–4214 (2008).
Xie, L., Altindal, T., Chattopadhyay, S. & lun Wu, X. Bacterial flagellum as a propeller and as a rudder for efficient chemotaxis. Proc. Natl Acad. Sci. USA 108, 2246 – 2251 (2011).
Ipiña, E. P., Otte, S., PontierBres, R., Czerucka, D. & Peruani, F. Bacteria display optimal transport near surfaces. Nat. Phys. 15, 610–615 (2019).
Wan, K. Y. & Goldstein, R. E. Time irreversibility and criticality in the motility of a flagellate microorganism. Phys. Rev. Lett. 121, 058103 (2018).
Tsang, A. C. H., Lam, A. T. & RiedelKruse, I. H. Polygonal motion and adaptable phototaxis via flagellar beat switching in the microswimmer euglena gracilis. Nat. Phys. 14, 1216–1222 (2018).
Gao, W. et al. Cargotowing fuelfree magnetic nanoswimmers for targeted drug delivery. Small 8, 460–467 (2012).
Zhang, L. et al. Characterizing the swimming properties of artificial bacterial flagella. Nano Lett. 9, 3663–3667 (2009).
Ghosh, A. & Fischer, P. Controlled propulsion of artificial magnetic nanostructured propellers. Nano Lett. 9, 2243–2245 (2009).
Ceylan, H. et al. 3dprinted biodegradable microswimmer for theranostic cargo delivery and release. ACS Nano 13, 3353–3362 (2019).
Huang, T.Y. et al. 3d printed microtransporters: compound micromachines for spatiotemporally controlled delivery of therapeutic agents. Adv. Mater.27, 6644–6650 (2015).
Nassif, X., Bourdoulous, S., Eugène, E. & Couraud, P.O. How do extracellular pathogens cross the blood–brain barrier? Trends Microbiol. 10, 227–232 (2002).
Celli, J. P. et al. Helicobacter pylori moves through mucus by reducing mucin viscoelasticity. Proc. Natl Acad. Sci. USA 106, 14321–14326 (2009).
Mirbagheri, S. A. & Fu, H. C. Helicobacter pylori couples motility and diffusion to actively create a heterogeneous complex medium in gastric mucus. Phys. Rev. Lett. 116, 198101 (2016).
Purcell, E. M. Life at low Reynolds number. Am. J. Phys. 45, 3–11 (1977).
Hu, W., Lum, G. Z., Mastrangeli, M. & Sitti, M. Smallscale softbodied robot with multimodal locomotion. Nature 5554, 81–85 (2016).
Ohm, C., Brehmer, M. & Zentel, R. Liquid crystalline elastomers as actuators and sensors. Adv. Mater. 22, 3366–3387 (2010).
Dai, B. et al. Programmable artificial phototactic microswimmer. Nat. Nanotechnol. 11, 1087–1092 (2016).
Palagi, S. et al. Structured light enables biomimetic swimming and versatile locomotion of photoresponsive soft microrobots. Nat. Mater. 15, 647 (2016).
von Rohr, A., Trimpe, S., Marco, A., Fischer, P. & Palagi, S. Gait learning for soft microrobots controlled by light fields. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 6199–6206 (IEEE, 2018).
Huang, H.W., Sakar, M. S., Petruska, A. J., Pané, S. & Nelson, B. J. Soft micromachines with programmable motility and morphology. Nat. Commun. 7, 12263 (2016).
Huang, H.W. et al. Adaptive locomotion of artificial microswimmers. Sci. Adv. 5, eaau1532 (2019).
Reddy, G., Celani, A., Sejnowski, T. & Vergassola, M. Learning to soar in turbulent environments. Proc. Natl Acad. Sci. USA 113, E4877 – E4884 (2016).
Reddy, G., WongNg, J., Celani, A., Sejnowski, T. J. & Vergassola, M. Glider soaring via reinforcement learning in the field. Nature 562, 236–239 (2018).
Gazzola, M., Tchieu, A. A., Alexeev, D., de Brauer, A. & Koumoutsakos, P. Learning to school in the presence of hydrodynamic interactions. J. Fluid Mech. 789, 726–749 (2016).
Biferale, L., Bonaccorso, F., Buzzicotti, M., Clark Di Leoni, P. & Gustavsson, K. Zermelo’s problem: optimal pointtopoint navigation in 2d turbulent flows using reinforcement learning. Chaos 29, 103138 (2019).
Verma, S., Novati, G. & Koumoutsakos, P. Efficient collective swimming by harnessing vortices through deep reinforcement learning. Proc. Natl Acad. Sci. USA 115, 5849–5854 (2018).
Jiao, Y. et al. Learning to swim in potential flow. Phys. Rev. Fluids 6, 050505 (2021).
Cichos, F., Gustavsson, K., Mehlig, B. & Volpe, G. Machine learning for active matter. Nat. Mach. Intell. 2, 94–103 (2020).
Tsang, A. C. H., Demir, E., Ding, Y. & Pak, O. S. Roads to smart artificial microswimmers. Adv. Intell. Syst. 2, 1900137 (2020).
Colabrese, S., Gustavsson, K., Celani, A. & Biferale, L. Flow navigation by smart microswimmers via reinforcement learning. Phys. Rev. Lett. 118, 158004 (2017).
Alageshan, J. K., Verma, A. K., Bec, J. & Pandit, R. Machine learning strategies for pathplanning microswimmers in turbulent flows. Phys. Rev. E 101, 043110 (2020).
Schneider, E. & Stark, H. Optimal steering of a smart active particle. Europhys. Lett. 127, 64003 (2019).
MuiñosLandin, S., Fischer, A., Holubec, V. & Cichos, F. Reinforcement learning with artificial microswimmers. Sci. Robot. 6, eabd9285 (2021).
Yang, Y., Bevan, M. A. & Li, B. Micro/nano motor navigation and localization via deep reinforcement learning. Adv. Theory Simul. 3, 2000034 (2020).
Tsang, A. C. H., Tong, P. W., Nallan, S. & Pak, O. S. Selflearning how to swim at low Reynolds number. Phys. Rev. Fluids 5, 074101 (2020).
Hartl, B., Hübl, M., Kahl, G. & Zöttl, A. Microswimmers learning chemotaxis with genetic algorithms. Proc. Natl Acad. Sci. USA 118, e2019683118 (2021).
Najafi, A. & Golestanian, R. Simple swimmer at low Reynolds number: three linked spheres. Phys. Rev. E 69, 062901 (2004).
LedesmaAguilar, R., Löwen, H. & Yeomans, J. A circle swimmer at low Reynolds number. Eur. Phys. J. E 35, 1–9 (2012).
Avron, J. E., Kenneth, O. & Oaknin, D. H. Pushmepullyou: an efficient microswimmer. New J. Phys. 7, 234 (2005).
Golestanian, R. & Ajdari, A. Stochastic low Reynolds number swimmers. J. Phys. Condens. Matter 21, 204104 (2009).
Alouges, F., DeSimone, A., Giraldi, L. & Zoppello, M. Selfpropulsion of slender microswimmers by curvature control: Nlink swimmers. Int. J. Nonlinear Mech. 56, 132–141 (2013).
Wang, Q. Optimal strokes of low reynolds number linkedsphere swimmers. Appl. Sci. 9, 4023 (2019).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at arxiv: 1707.06347 (2017).
Liu, Y., Zou, Z., Tsang, A. C. H., Pak, O. S. & Young, Y.N. Mechanical rotation at low Reynolds number via reinforcement learning. Phys. Fluids 33, 062007 (2021).
Happel, J. & Brenner, H. Low Reynolds Number Hydrodynamics: with Special Applications to Particulate Media (Noordhoff International Publishing, 1973).
Kim, S. & Karrila, S. J. Microhydrodynamics: Principles and Selected Applications (Dover, New York, 2005).
Dhont, J. An Introduction to Dynamics of Colloids (Elsevier, 1996).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, Cambridge, 1998).
Golestanian, R. & Ajdari, A. Analytic results for the threesphere swimmer at low Reynolds number. Phys. Rev. E 77, 036308 (2008).
Howse, J. R. et al. Selfmotile colloidal particles: From directed propulsion to random walk. Phys. Rev. Lett. 99, 048102 (2007).
Lobaskin, V., Lobaskin, D. & Kulić I. M. Brownian dynamics of a microswimmer. Eur. Phys. J.: Spec. Top. 157, 149–156 (2008).
Dunkel, J. & Zaid, I. M. Noisy swimming at low Reynolds numbers. Phys. Rev. E 80, 021903 (2009).
Acknowledgements
Funding support by the National Science Foundation (Grant Nos. 1830958 and 1931292 to O.S.P. and Grant Nos. 1614863 and 1951600 to Y.N.Y.) is gratefully acknowledged. Y.N.Y. acknowledges support from Flatiron Institute, part of Simons Foundation. A.C.H.T. acknowledges funding support from the Croucher Foundation. Z.Z. and O.S.P. acknowledge the use of computational resources at the WAVE computing facility (enabled by the E.L. Wiegand Foundation) at Santa Clara University. We also thank Yi Fang for useful discussion.
Author information
Authors and Affiliations
Contributions
Z.Z., Y.L., Y.N.Y., O.S.P., and A.C.H.T. designed research; Z.Z., Y.L., Y.N.Y., O.S.P., and A.C.H.T. performed research; Z.Z., Y.L., Y.N.Y., O.S.P., and A.C.H.T. analyzed data; and Z.Z., Y.L., Y.N.Y., O.S.P., and A.C.H.T. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Physics thanks Giovanni Volpe and the other, anonymous, reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zou, Z., Liu, Y., Young, YN. et al. Gait switching and targeted navigation of microswimmers via deep reinforcement learning. Commun Phys 5, 158 (2022). https://doi.org/10.1038/s4200502200935x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4200502200935x
This article is cited by

Drag force on a microrobot propelled through blood
Communications Physics (2024)

Machine learning for micro and nanorobots
Nature Machine Intelligence (2024)

Learning to cooperate for lowReynoldsnumber swimming: a model problem for gait coordination
Scientific Reports (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.