Learning efficient navigation in vortical flow fields

Efficient point-to-point navigation in the presence of a background flow field is important for robotic applications such as ocean surveying. In such applications, robots may only have knowledge of their immediate surroundings or be faced with time-varying currents, which limits the use of optimal control techniques. Here, we apply a recently introduced Reinforcement Learning algorithm to discover time-efficient navigation policies to steer a fixed-speed swimmer through unsteady two-dimensional flow fields. The algorithm entails inputting environmental cues into a deep neural network that determines the swimmer’s actions, and deploying Remember and Forget Experience Replay. We find that the resulting swimmers successfully exploit the background flow to reach the target, but that this success depends on the sensed environmental cue. Surprisingly, a velocity sensing approach significantly outperformed a bio-mimetic vorticity sensing approach, and achieved a near 100% success rate in reaching the target locations while approaching the time-efficiency of optimal navigation trajectories.

For navigating through an unsteady von Kármán vortex street, flow sensing appeared critical for learning effective swimming strategies with Reinforcement Learning. For navigating steady flows with a fixed target, however, an RL-swimmer may be able to navigate simply by forming a one-to-one correspondence between its position and the fixed background flow field.
Consider the steady 2D flow past a cylinder at a Reynolds number of 40. Here, the target position is fixed to be just downstream of the cylinder, and the swimmers are started randomly throughout the entire domain with a swimming speed of U swim = 0.5U ∞ (see Supplementary Figure 2). Successful swimmers generally start close enough to the cylinder to use the wake to navigate to the target, while unsuccessful swimmers generally start too far away from the cylinder to reach the target before being swept downstream.
We trained a flow-blind swimmer (s = {∆x, ∆y}) and a vorticity swimmer (s = {∆x, ∆y, ω n , ω n−1 }). Even though the flow-blind swimmer had no knowledge of the background flow, it trained as quickly as the vorticity swimmer, which can be seen in the evolution of the cumulative reward over training (Supplementary Figure 1). Additionally, both swimmers were equally successful at reaching the target after training, which is visualized by plotting the region of starting points from which the swimmer can reach the target ( Supplementary Figure 2). This region is computed by extracting the ridges of the finite-time lyapunov exponent field [1] of the flow field formed by the background flow plus the swimmer's learned policy.

Target
Vorticity RL Flow Blind RL Naive Supplementary Figure 2. Success of a naïve swimmer, flowblind RL swimmer, and vorticity RL swimmer at navigating the steady flow past a cylinder. The plotted curves display the right-hand (i.e. downstream) boundary of the region in which the swimmer can reach the target. This region of attraction to the target is identical for both RL swimmers, and larger than the region for the naïve swimmer. This implies that in this steady flow field, both RL swimmers are equally successful at reaching the target despite one swimmer lacking flow sensing abilities.
Because an RL swimmer can navigate steady flow with position alone, the unsteady cylinder wake was chosen for testing flow sensing-based navigation. Additionally, the starting time was randomized, as presenting a swimmer with a repeated, deterministic snapshot of a flow field could also be navigable by memorizing the flow based on position alone.

SUPPLEMENTARY NOTE 2 -REINFORCEMENT LEARNING ALGORITHM
We employed the V-RACER algorithm for training the deep RL swimmers using the smarties framework. Some details of the algorithm are presented here, but a complete description can be found in [2].
The goal of V-RACER is to train the weights w of a neural network using experiences with the environment. At each time-step t, the neural network takes in the agent's state s t as an input and outputs the mean action µ w , standard deviation σ w , and value estimate v w . During training, the swimmer takes an action a t sampled from the normal distribution N (µ w , (σ w ) 2 ). The architecture of this deep neural network is shown below in Supplementary Figure 3.
At every time step in an episode, the agent takes the action a t and stores information such as the state (s t ), reward information (r t ,V tbc t ), the current policy (µ t , σ t ), and action it took (a t ) in the Replay Memory (RM). The number of recorded experiences is kept to a fixed size: as new experiences are added, the oldest ones are forgotten. is inputted into a deep neural network and the output is a mean action µ w , standard deviation σ w , and value v w . During training, the swimmer chooses a swimming angle by random sampling from a normal distribution: θ ∼ N (µ w , (σ w ) 2 ). After training, the mean action is selected (θ = µ w ).
To update the network's weights and biases, experiences are sampled from the RM, and a gradient is computed for each past experience according to two loss functions. The value output v of the neural network is trained using the following loss function, which seeks to improve the neural network's value estimate v w (s t ) to better match the estimated value from experiencesV tbc t : (1) The mean and standard deviation outputs (µ w and σ w ) are trained using the following loss function, which seeks to change the policy to improve on-policy returns: where the importance weight ρ w t is the ratio of the probability of selecting a t given s t using the current policy π(· | s t ) ∼ N (µ w , (σ w t ) 2 ) and the old policy β(· | s t ) ∼ N (µ t , σ 2 t ): The estimate of the returnV tbc t is calculated using a recursive formula starting at the terminal time step N of an episode in the RM and stepping backwards: The gradient estimate is then: If the old policy is too dissimilar from the current policy (1/c max < ρ w ti < c max ),ĝ ti is set to zero. Additionally, this gradient estimate is mixed with a gradient that points in the direction of the current policy to prevent the policy from changing too greatly in an unstable manner: (7) where D KL is the Kullback-Leibler divergence, and λ is a chosen parameter. For the RL swimmers, the neural network was updated usingĝ ReF −ER ti at each time step. Additional details, such as psuedocode, hyperparameters, and the scaling of the neural network inputs are shown in [2].
A network size of 128×128 was selected to ensure the network would be sufficiently expressive to learn effective navigation strategies. To confirm that a 128×128 is sufficiently expressive, we found that a network size of 64×64 was able to match the performance of a 128×128 network, which can be see below in Supplementary

Velocity Swimmer Velocity Swimmer 64x64 network
Supplementary Figure 4. Evolution of the cumulative reward for a velocity swimmer with a 128×128 neural network and a 64×64 neural network, which shows that a 128×128 neural network is sufficiently large to express an effective swimming policy.

SUPPLEMENTARY NOTE 3 -ADDITIONAL RL SWIMMERS
In addition to flow-blind, vorticity, and velocity RL swimmers, several other swimmers with different states were investigated. The u-velocity swimmer has access to the only the x component of the flow velocity (s = {∆x, ∆y, u}). The transverse-velocity swimmer has access to the component of the fluid velocity perpendicular to the swimmer's previous direction of travel (s = {∆x, ∆y, u ⊥ , v ⊥ }). Finally, the vorticity-velocity swimmer has access to the both the vorticity and velocity of the fluid (s = {∆x, ∆y, ω n , ω n−1 , u, v}).
The success rate of all swimmers is plotted in Supplementary Figure 5. All swimmers with access to partialflow information (e.g. vorticity, one velocity component) have a slightly higher success rate than the flow-blind swimmer. The two swimmers with access to both components of the fluid velocity reach a nearly 100 percent success rate, and the inclusion of vorticity in addition to the flow velocity did not appear to impact training time.

Success
Rate

SUPPLEMENTARY NOTE 4 -SENSOR NOISE
All swimmers have thus far been presented noiseless measurements of the background flow and the swimmer's position. In a biological context, such low-noise measurements may not be unreasonable: seals were reported to detect flow velocities as low as 245 microns per second using specially adapted whiskers [3]. In robotic systems, however, measurement noise could arise from a variety of sources, such as a measurement device itself, or from small scale turbulence when navigating with background flow. Given that turbulence is ubiquitous in real-world flows and the RL swimmers in the present study rely on background flow measurements to navigate, we investigated how noise in the flow measurements affects the success rate of RL swimmers. We found that the velocity swimmer can be robust to realistic amounts of flow sensing noise.
To simulate flow measurement noise, an already trained velocity RL swimmer was tasked with navigating across the unsteady cylinder wake with zero-mean Gaussian noise added to both components of its velocity measurement. The standard deviation of the velocity sensor noise, σ sensor , was varied between 0 and 0.5 times U ∞ . The success rates of the velocity swimmer for various amounts of noise are shown in Supplementary Figure  6.
The velocity RL swimmer demonstrated robustness to noise in its local velocity measurement, showing little decrease in the success rate with a σ sensor of up to 10 percent of the freestream flow velocity. With higher amounts of noise, the success rate decreased, although the velocity Success Rate 100% 0% 50% 25% 75% 0% 10% 20% 30% 40% 50% Supplementary Figure 6. Success rates of a velocity RL swimmer with various amounts of zero-mean Gaussian noise added to its local velocity measurement (position measurements were left noiseless). The stated success rates are averaged over 12,500 episodes and are shown with one standard deviation arising from the five times each swimmer was trained.
swimmer can still navigate more successfully than flowblind swimmer even when σ sensor reaches 40 percent of U ∞ . It is not surprising that velocity measurements are less useful at this noise level, because the noise is comparable to the measured signal. A convenient, albeit crude, metric for comparison with real fluid flows is the turbulence intensity, defined as the ratio of the root-mean squared velocity fluctuations over the mean velocity (u rms /Ū ). While turbulence does not exactly follow a Gaussian distribution [4] and not all scales of turbulence would be small enough to appear as random fluctuations to a swimmer [5], we can compare the turbulence intensity of real flows to the random noise added to the swimmer's sensor readings.
Typical turbulence intensities used for the design of underwater autonomous vehicles can range from 0.2 to 9 percent [6,7]. For tidal flows in the ocean, the turbulent intensity has been measured to be approximately 12 to 13 percent [8] but could be higher near surface waves or lower in the mid-ocean. For aerial vehicles, typical turbulence intensities could range from 1.2 to 12.6 percent [5]. By comparison, the RL swimmer has a minimal decrease in its success rate with a similar magnitude of random noise in its velocity measurement. To be sure, the turbulence intensity can range from 0 to infinity depending on the presence of a mean background current, and can vary in magnitude for different velocity components [5], so this comparison is only a rough approximation of in situ noise levels. A physical implementation of RL navigation with real-world noise would provide a more conclusive result, but it is nevertheless promising that perfectly noiseless flow measurements are not required for high navigation success in the simulated case.