A numerical study of fish adaption behaviors in complex environments with a deep reinforcement learning and immersed boundary–lattice Boltzmann method

Fish adaption behaviors in complex environments are of great importance in improving the performance of underwater vehicles. This work presents a numerical study of the adaption behaviors of self-propelled fish in complex environments by developing a numerical framework of deep learning and immersed boundary–lattice Boltzmann method (IB–LBM). In this framework, the fish swimming in a viscous incompressible flow is simulated with an IB–LBM which is validated by conducting two benchmark problems including a uniform flow over a stationary cylinder and a self-propelled anguilliform swimming in a quiescent flow. Furthermore, a deep recurrent Q-network (DRQN) is incorporated with the IB–LBM to train the fish model to adapt its motion to optimally achieve a specific task, such as prey capture, rheotaxis and Kármán gaiting. Compared to existing learning models for fish, this work incorporates the fish position, velocity and acceleration into the state space in the DRQN; and it considers the amplitude and frequency action spaces as well as the historical effects. This framework makes use of the high computational efficiency of the IB–LBM which is of crucial importance for the effective coupling with learning algorithms. Applications of the proposed numerical framework in point-to-point swimming in quiescent flow and position holding both in a uniform stream and a Kármán vortex street demonstrate the strategies used to adapt to different situations.

It has long been observed that fish can adapt to different environments and achieve their goals optimally. These adaption behaviors are essential for survival since they allow a fish to obtain and save energy as well as avoid risks. A typical example of adaption behavior is prey capture, in which the fish is trying to reach a target with given time (generalized as point-to-point swimming). Another important behavior that has been observed in many fishes is rheotaxis 1,2 , which is a tendency of the fish to directly face into an oncoming current to capture food carried by the flow. Furthermore, a unique energy-saving behavior termed Kármán gaiting is observed in rainbow trout and other fishes when swimming behind a bluff body in the flow, which is characterized by large-amplitude lateral motion of the body occurring at a low frequency [3][4][5] . In addition, fish may exploit the vortices shedding from its leading one or its fins to improve its swimming performance [6][7][8] , of which the propulsion mechanism can be further revealed by separating the drag and thrust 9 . In nature, fish are able to achieve the above mentioned behaviors in a very quick and efficient way, with which current man-made vehicles cannot compete. Therefore, it is important to achieve them in numerical simulation, with which researchers are able to understand the design concepts of fish, and to put these concepts into man-made vehicles.
The mechanisms underlying these adaption behaviors are complex and have not been fully understood. The cooperation between the sensory system, the neural system and the muscles, which forms a precise and robust feedback control system, is of primary importance. The sensory system is responsible for continuously collecting information about the environment which is input into the neural system so that the fish can update its knowledge of its surroundings in real time. Based on this information, the fish may change its swimming kinematics via the muscles to achieve its goals. During over 400 millions years of evolution, a variety of sensory systems

Numerical model and methodology
The shape and motion of a trout model. The shape of the 2D swimmer model here is reconstructed from the cross-section of a trout 49 as shown in Fig. 1. The half thickness of the body is mathematically approximated by www.nature.com/scientificreports/ where L is the body length, and s l is the arc length along the mid-line of the body. The motion of the body includes three parts as shown in Fig. 2: the undulation motion of the body ( h l in x l -y l system), the translation of the mass center ( r c ), and the body rotation around the mass center ( θ ). The undulatory motion can be taken as the superposition of different waves propagating from head to tail. In order to implement the DRQN in an easy way as explained later, every wave lasts only half cycle. In the n-th half cycle, the mid-line lateral displacement is determined by where θ l is the deflection angle of the mid-line with respect to axis x l as shown in Fig. 2, θ lmax is the maximum deflection angle at the trailing edge, n is the wavelength, T n is the period, t is the time, t 0n = 0 for n = 1 and n−1 1 T n for n > 1 , and h is the waveform function described by where c 0−5 can be determined by h(0) = h( n /2) = 1 , h ′ (0) = h ′ ( n /2) = 0 , h ′′ (0) = −(2π/ n−1 ) 2 , and h ′′ ( n /2) = −(2π/ n ) 2 . This undulatory motion is constructed based on extensive videos of rainbow trout free swimming, rheotaxis and Kármán gaiting [49][50][51] . It allows the swimmer to change its periods, amplitudes and wavelengths smoothly and arbitrarily every half period. Therefore, the swimmer model is able to choose appropriate combinations of different kinematics to achieve different maneuvering movements such as accelerating, decelerating and yawing, which enables the fish to handle complex and fast-changing environments.
The translational and rotational motions of the swimmer are determined by FSI in the global coordinate system (x, y) according to: where m is the mass of the fish, F t is the total hydrodynamic force on fish body, I z is the inertia moment of the center of the mass, and M z is the total hydrodynamic torque on the center of the mass. IB-LBM for the fluid-structure-interaction system. An IB-LBM is adopted to solve the FSI system [52][53][54] . In this method, the fluid dynamics is obtained by solving the multiple-relaxation-time lattice Boltzmann equation, where f is the distribution function, x = (x, y) is the space coordinate, c i is the discrete lattice velocity, t is time step, and i and G i are respectively the collision operator and the source term. Here i and G i are obtained by www.nature.com/scientificreports/ where M is a 9 × 9 transformation matrix, S is the relaxation matrix, I is the identity matrix, and f eq i and F i are respectively the equilibrium distribution function and the effect of the fluid body force. f eq i and F i are determined by where w i is a weighting coefficient, ρ is the density of the fluid, u is the velocity of the fluid, c s = �x/( √ 3�t) is the lattice speed of sound, x is the lattice spacing, and g is the body force. In this work, D2Q9 is used. The M and S matrices of this model can be found in Lallemand and Luo 52 and Krüger et al. 55 . c 0 -c 8 are (0, 0), (±1, 0), (0, ±1), (±1, ±1) . w 0 = 4/9 , w 1 -w 4 = 1/9 , and w 5 -w 8 = 1/36.
Once the distribution function f is obtained, the macro fluid density ρ , velocity u , pressure p, viscous stress tensor σ αβ and fluid force density on the boundary F f in the new time step are calculated with where n Bβ is the outer normal vector of the boundary S B , δ αβ is the Kronecker delta, and α and β are dummy indices. The forces and moment of the fluid exerting on the swimmer model can be calculated with where F D is the drag, F L is the lift, X is the Lagrangian coordinate on the fish surface, s 0 is the arc length along the surface of the swimmer, and e x , e y and e z are the unit vectors along x-axis, y-axis and yaw axis, respectively.
In addition, the IBM is utilized to handle the boundary condition at the fluid-structure interface according to where F IB is the Lagrangian force on the immersed boundary, g IB is the fluid body force due to the boundary, S B is the boundary surface of the rigid body, η is the feedback coefficient, u B is the prescribed moving speed of the boundary surface, and V f is the fluid domain. The value of η is determined by the geometry of the body, which can be found in Refs. 53,54 . δ is approximated with a kernel function, Furthermore, a multi-block geometry-adaptive Cartesian grid is coupled with the IB-LBM to improve the computational efficiency. A detailed description of this grid structure and method can be found in Refs. 53,54 . The fluid-structure system is coupled by an explicit FSI coupling according to, www.nature.com/scientificreports/ where u c is the velocity of the mass center, and ω is the angular velocity. Since no iteration is required at each time step, this method is much more efficient than strong coupling methods 39,56 . Yoshino et al. 43 compared the computational efficiency between the LBM and the finite difference method (FDM) in modeling lid-driven cavity flows, and found that the CPU time of each step for the LBM is about 1/3 of that for the FDM, indicating the LBM is more efficient than the FDM in modeling fluid dynamics, which is of crucial importance for the coupling with the reinforcement learning method as each learning application normally requires thousands of simulation cases. The FSI process implemented by the IB-LBM is shown in Algorithm 1.
Deep reinforcement learning. Deep reinforcement learning combines reinforcement learning with an artificial neural network to approach human-level control in complex real-world problems 57 . One of the most successful methods in reinforcement learning is one-step Q-learning. In this study, the DRQN 58 is used where a one-step Q-learning is coupled with a three-layer long-short-term-memory recurrent neural network (LSTM-RNN). Q-learning describes a general process of an agent learning how to achieve a goal during prolonged and continued interaction with its environment by trial and error 18 . During this process, the agent must be able to sense a defined set of parameters representing the state of the environment (denoted by s) and take actions (denoted by a) to affect it. Each action is assessed with a scalar number called the reward (denoted by r) whose value indicates whether the agent moves towards or away from the goal by taking the action. In order to achieve the goal, the agent must seek actions that maximize its expected cumulative rewards in the long run (also known as the action-value function) which is defined as, where s n and a n are respectively the n-th state and action, r n+1 is the reward of n-th action, r n+2 , r n+3 and r n+4 are the subsequent rewards, and γ is the discount rate ranging from 0 to 1. If γ = 0 , the agent is termed "myopic" because it only maximizes the immediate rewards. Larger γ means that the agent is more "far-sighted". For all cases in this paper, γ is chosen to be 0.99 as in Ref. 57 . The principle in Q-learning is that the agent explores different actions in different states and evaluates the actions with Q(s, a), so that when the state reoccurs, the agent will choose the optimal action to achieve its goal.
(25) Q(s n , a n ) = E[r n+1 + γ r n+2 + γ 2 r n+3 + γ 3 r n+4 + · · · | s n , a n ], www.nature.com/scientificreports/ Q-learning suffers from the classical "curse of dimensionality" problem, where the data and computational resource required grow exponentially with the dimensionality of the state and action spaces. Deep reinforcement learning has partly resolved this problem by approximating the action-value function with a neural network Q, which can generalize past experience to new situations 57 . In this work, an LSTM-RNN composed of three layers of 64 LSTM cells and a linear output layer is adopted. In order to find the optimal action-value function, the neural network is iteratively updated by minimizing the temporal difference error where Q * (s, a) is the optimal (maximized) action-value function, i.e. Q * (s n+1 , a * n+1 ) = max a Q(s n+1 , a) for all actions in state s n+1 , and a * is the optimal action maximizing Q. This can be achieved by updating the network weights via gradient descent methods, where ws i is weight of the network, α is the learning rate. For efficient updating, the gradient descent is performed with the Adam optimization algorithm 59 .
(26) TD err = r n+1 + γ Q * (s n+1 , a * n+1 ) − Q(s n , a n ), www.nature.com/scientificreports/ The state, action, reward, and next state quadruplet (s n , a n , r n+1 , s n+1 ) generated with agent-environment interaction are required to update the neural network. A replay memory D and a target neural network Q target are introduced in the iteration process 57 . The replay memory is used to store large numbers of quadruplets (s n , a n , r n+1 , s n+1 ) which are sampled randomly in a mini-batch [. . . , (s i n , a i n , r i n+1 , s i n+1 ), . . .] to update Q. This technique breaks the correlation between the samples to avoid local optimization 57 . The sizes of the replay memory ( N D ) and the mini-batch ( N b ) are respectively set as N D = 5000 and N b = 100 . The target neural network is used to generate the optimal action value Q * (s n+1 , a * n+1 ) in Eq. (27). It is updated with Q for every N tgt action steps to avoid the instability caused by frequent update of the optimal action-value function 57 . N tgt is set to be 100. The learning parameters (i.e. N D , N b and N tgt ) have been tested to ensure the stability of the learning process.
The detailed interaction procedure is summarized in Algorithm 2 where the agent-environment interaction is broken into N e episodes. Each episode is divided into a sequence of discrete action steps n = 0, 1, 2, 3, · · · . At step n of each action, the agent detects a state s n , and selects an action a n based on a policy π(s, a) which describes the probability of selecting each possible action in each state. At action step n + 1 , in response to the action a n , the agent receives a reward r n+1 , and finds itself in a new state s n+1 . The ǫ-greedy policy 60 is used to select actions, with which the agent chooses the optimal action (also known as exploiting) with probability 1 − ǫ and other actions (also known as exploring) with probability ǫ . ǫ gradually decays from 1 to 0.05 so that the agent explores more at the initial stage of the simulation but exploits more in the long term afterwards.
It should be noted that compared to existing models of learning for fish [21][22][23] , this work incorporates the fish position, velocity and acceleration into the state space in the DRQN; and it considers the amplitude and frequency action spaces as well as the historical effects.

Validation of the fluid solver
The current flow solver has been validated in previous publications 53,54 . Here we further provide applicationspecific validations by focusing on the frequency of vortex shedding from a cylinder in a uniform flow and the swimming speed of an anguilliform swimmer in a quiescent flow. The cases are conducted with 20 computational cores on a workstation with Intel Xeon CPU E5-2650 and OpenMP.  Fig. 3. The mean drag coefficient C D and the peak-to-peak lift coefficient C L at Re = 100 are compared with other studies in Table 1. The simulation requires about 1.44s of CPU time per nondimensional time unit tU/D = 1.0 . Here the drag and lift coefficients are respectively defined by C D = F D /(0.5ρU 2 D) and C L = F L /(0.5ρU 2 D) . Figure 3 and Table 1 show that St, C D and C L predicted by our solver agree well with those in previous publications.

Self-propelled anguilliform swimmer swimming in a quiescent flow.
Here an anguilliform swimmer swimming in a quiescent flow is conducted to validate the capability of the current fluid solver for modelling a self-propelled swimmer. The half thickness of the swimmer is described as 68 The forward velocity u 0 /U ( U = L/T is one body per waving period) predicted by the current solver is shown in Fig. 5 and compared with the results reported by Kern and Koumoutsakos 68 . It is noted that the balanced swimming velocity of the present study is smaller than that of Kern and Koumoutsakos 68 and Case b (with divergence-free correction of body motion) of Gazzola et al. 69 , but agrees well with Case a (without divergencefree correction of body motion) of Gazzola et al. 69 . As the divergence of body motion does not affect the learning process considered in this work, and thus is not corrected in order to save computational costs.

Point-to-point swimming.
Here we apply the coupled approach to the point-to-point swimming of a sub-carangiform swimmer in a quiescent flow. The swimmer of length L is placed in a circular area with radius R = 5L , as shown in Fig. 6. Its goal is to reach the center O from any position within the circular area and arbitrarily given orientation. This goal is reflected by defining a reward as where r tip is the distance between the head of the swimmer and the center O. The swimmer propels itself by periodically generating a travelling wave propagating from head to tail, as defined by Eqs. (2) and (3). In order to achieve high maneuverability, the swimmer can change the wave amplitude every half swimming cycle. Each selected set of parameters is considered as an action. In this case, U = 1L/s is chosen as the characteristic velocity. The period is fixed at TU/L = 1.0 ; the amplitude action base is defined as θ lmax = 0 • , 20 • , 40 • , 60 • , 80 • , 100 • , 120 • , 140 • and 160 • ; and the wavelength is fixed at = L . This parameter set forms an action base of 9 components.
The state is an important component in the DRQN. Theoretically, it should include the information of the swimmer and the ambient flow. The information of the swimmer includes the body waveform, position, pitch angle, velocity, angular velocity, acceleration and angular acceleration of the body. The flow information includes the flow velocity and pressure in the whole flow field. The historical evolution of the flow should also be considered. Therefore, it is impossible to consider the flow information as a simple definition of the state. One way to resolve this problem is to consider the information of the swimmer only as that in the work of Gazzola et al. 21 , Novati et al. 22 and Verma et al. 23 . However, ignoring the flow information will make the learned policy inaccurate as shown in Fig. 7a, where only the body waveform, position and pitch angle are considered in all the states. The fish is able to reach its destination in different stages of the learning process, but the path is highly diverse and complicated, and not improving with learning. Figure 7b shows the total number of periods ( N p ) the fish takes to reach its destination for all learning episodes. In the first 500 episodes, the fish dramatically decreases its time needed to reach the goal, indicating the fish is continuously learning and improving its swimming policy. However, after 500 episodes, the required time grows gradually, indicating the policy is getting worse as the learning progresses. This is because the defined states without considering the flow information is not able to capture the variability of the environment.
Here, we propose a method to consider the influence of the flow information in the states without having to deal with the complexity of the flow. Considering the flow is developed from the historical actions and fish dynamics, it is partially reflected by the dynamics and actions of the swimmer in the past time. If the whole historical dynamics and actions are considered in the states, the flow information is naturally included. However, tracking the whole historical dynamics and actions is memory and time consuming and not necessary since the far history only has minor influence on the flow dynamics at current instant. Our simulations show that only considering the historical dynamics and actions of the fish in the last 4 periods is enough to capture the flow dynamics. In order to further reduce the complexity, accelerations are not considered in the state. The state is thus defined by a tuple www.nature.com/scientificreports/ where θ tip is the orientation angle of the swimmer relative to r tip (as defined in Fig. 6), ū cxl and ū cyl are the mean swimming velocities over half a period in the x l and y l directions, and ω is the mean angular velocity over half a period. For a real fish, r tip and θ tip can be directly sensed by the eyes, while ū cxl , ū cyl and ω can be sensed by the lateral line system 11,70,71 . Therefore, it is reasonable to use these quantities to define the state. The simulation is performed for a Reynolds number of Re = ρUL/µ = 1000 . It should be noted that this is not a typical Reynolds number for an adult fish. Instead it is for a juvenile fish less than 5cm swimming in this scope. This Reynolds number is used to reduce the computational cost, while such setup is sufficient to demonstrate the effectiveness of the coupled DRQN and IB-LBM. The computational domain of 50L × 50L is divided into 7 blocks with about 41.3 × 10 3 initial points. The minimum nondimensional grid spacing is �x/L = �y/L = 0.01 near the inner boundaries and the nondimensional time step size is �tU/L = 0.01 . The simulation requires about 2.52s of CPU time per nondimensional time unit tU/L = 1.0 . The learning parameters are set to α = 0.001 and γ = 0.99 , while ǫ decays from 1 to 0.05 gradually. These parameters are chosen to ensure the stability of the learning process.
The learning process is divided into a series of episodes. In each episode, the initial position (r tip ) 0 is randomly chosen between L and 5L and the initial orientation (θ tip ) 0 randomly varies between −90 • and 90 • . The position and orientation of the swimmer are then determined by the FSI with the actions. Once the swimmer exceeds the circular area or reaches the center or reaches 200 periods in the area, the episode ends and another starts. Figure 8 shows the traces of the head during different learning stages and the total number of swimming periods the fish maintains in the swimming area for all episodes considered. As shown in Fig. 8a, the swimmer swims randomly in episode 11. Nevertheless, after a trial and error exploration period, it learns to adjust its orientation and swims around the center O (episode 338). After learning for 545 episodes, it successfully finds a tortuous path to reach the center O. However, at episode 3890, it has learned how to directly swim towards its destination. This is further demonstrated by Fig. 8b, from which it is found that in the first 2000 learning episodes, the total number of swimming periods decreases rapidly. After around 2000 episodes, the total number of swimming periods remains at a low value, indicating the swimmer has found an efficient way to reach its goal. Figure 9 presents the traces when the fish swims to its destination with different (r tip ) 0 and (θ tip ) 0 after learning for 10,000 episodes. 8 cases are studied. In the first 4 cases, (θ tip ) 0 is fixed at 75 • while (r tip ) 0 takes on the values 1L, 2L, 3L and 4L. In the other 4 cases, (r tip ) 0 is fixed at 3L while (θ tip ) 0 takes on the values 0 • , 25 • , 50 • and 75 • . In all cases, the fish directly swims to its destination with a very short path. Figure 10 shows the vorticity contours at different instants while the fish swims to its destination with an initial distance of (r tip ) 0 = 3L and an initial orientation of (θ tip ) 0 = 75 • . Initially the fish is at rest with the destination to its right (see Fig. 10a). Then it undulates with large right amplitude (see Fig. 10b) and small left amplitude (see Fig. 10c) to perform a fast right turn. After directly facing the destination, it swims with nearly equal left (see Fig. 10e) and right (see Fig. 10d)amplitudes. At around 12 periods, the fish successfully reaches the destination (see Fig. 10f).

Rheotaxis.
Here we apply the coupled approach to the rheotaxis swimming of a sub-carangiform swimmer in a uniform flow. Its goal is to hold position in a circular area of radius R = 5L as shown in Fig. 11 for more than 200 periods. The situation is highly unstable since a small displacement in orientation away from the flow direction could lead to high lateral forces making the agent swim away from its original position. This goal is reflected by defining a reward as  Note that the information of the position r c and orientation θ is implied in the translational and rotational velocities, and thus is not necessary for the fish to sense. Therefore, the state is simplified to be where ū cx and ū cy are the mean translational velocities of the center of the mass in each half a period parallel and perpendicular to the flow orientation.
The simulation is performed for a Reynolds number of Re = ρUL/µ = 1000 . The computational domain of 50L × 50L is divided into 7 blocks with about 45.6 × 10 3 initial points. The minimum nondimensional grid spacing is �x/L = �y/L = 0.01 near the inner boundaries and the nondimensional time step size is �tU/L = 0.01 . The simulation requires about 2.99s of CPU time per nondimensional time unit tU/L = 1.0 . The learning parameters are set to α = 0.001 and γ = 0.99 , while ǫ decays from 1 to 0.05 gradually.
The swimmer is initially placed in the center O of the swimming area with its initial orientation angle θ 0 randomly varying between −45 • ≤ θ 0 ≤ 45 • . Figure 12 shows the traces of the center of the mass during different (32) r = −|ū|, (ū cy ) n ,ω n , (ū cx ) n−1 , (ū cy ) n−1 ,ω n−1 , a n−1 . . . , . . . , . . . , . . . , (ū cx ) n−8 , (ū cy ) n−8 ,ω n−8 , a n−8   Fig. 12b), the total number of swimming periods increases rapidly, indicating the fish is learning to hold position. After approximately episode 200, the fish is able to maintain in the swimming area for more than 200 periods, indicating it has found a policy to hold position. Figure 13 compares the traces of the center of the mass and the change of the orientation angle θ after learning for 1000 episodes. 4 cases are studied with initial oritentation θ 0 values of 0 • , 10 • , 20 • and 30 • . The swimmer holds position for more than 200 periods in all cases as shown in Fig. 13a. As shown in Fig. 13b, it rapidly adjusts its orientation during the first 10 periods to align its body against the flow, and thereafter tries to hold its position.
The lateral movement of the tail when the swimmer is holding position is presented in Fig. 14. A repetitive undulating pattern is apparent that lasts for 4 flapping periods. In this pattern, we can identify two types of tail movement: the first is continuously increasing the left amplitude (Pattern 1), the second is continuously increasing the right amplitude (Pattern 2). These flapping patterns trigger two types of wake vortices as shown in Fig. 15. In the first type of wake vortices (Fig. 15a), the vortices form a jet wake deflected slightly to the right side of the swimmer, causing it to move to its left and to rotate clockwise slightly. In the second type of wake vortices (Fig. 15b), a leftward deflected jet is formed causing the swimmer to move to the right and to rotate anticlockwise. These patterns of vortices happen in turn, realizing a dynamical balance in the hydrodynamic forces to hold position in the flow.

Position holding in a Kármán vortex street.
Here we apply the coupled approach to the position holding behavior in a Kármán vortex street. The Kármán vortex street is an example of a drag wake, characterized by a repeating pattern of swirling vortices. It is a complex but mostly predictable flow environment. The abundant vortices make the fluid dynamics in different areas highly diverse and unsteady and there is always a certain amount of unpredictable variation in the vortex behaviors 72 . Furthermore, a fish in the Kármán vortex street selectively explores the flow and swims back and forth when slaloming around the incoming vortices 73 , which makes the encountered flow field more variable and unpredictable.
Liao and Akanyeti 50,72,73 conducted a series of experiments to observe the kinematics of live rainbow trouts in the Kármán vortex street, in which the fish were placed in the wake behind a D-shaped cylinder. They found that the midline kinematics of the fish could be represented as a superimposition of four midlines generated by four motion components: lateral translation, body bending, body rotation and head motion, whose contributions were respectively 67.8% , 19.9% , 9.0% and 3.3% in terms of the swept area. The frequencies of the tail beats matched the vortex shedding frequency. The body wavelength was approximately 25% larger than the wake wavelength. In addition, the peak-to-peak tail beat amplitude was nearly the same as the diameter of the cylinder.
A D-shaped cylinder of diameter D = 0.3L is chosen in our simulation to produce the Kármán vortex street for comparison with the experiment of Liao 50 . The Strouhal number of the vortex street is St = fD/U = 0.1875 resulting in a non-dimensional vortex frequency fL/U = 0.625 and period TU/L = 1.6 . The wavelength of the vortex street is around 1L. The fish is trained in a rectangular area of 8L × 4L . Its goal is to hold its horizontal position in the vortex street for more than 200 periods. The goal is reflected by defining a reward as In this case, the period action base is defined as TU/L = 1.2 , 1.4, 1.6, 1.8 and 2.0; the amplitude action base is defined as θ lmax = 16 • , 34 • , 51 • , 72 • and 97 • ; and the wavelength is fixed at = 1.5L . These parameter sets are within the range of the observation of Liao and Akanyeti 50,72,73 .
The hydrodynamic forces exerted on the fish are also included in the state to better capture the dynamic nature of the flow field. In order to reduce the complexity, body rotation and head motion are not considered, which is based on the observation by Akanyeti and Liao 72 who found that nearly 90% of the body motion of a live rainbow trout will be captured by the present model. Therefore, the state is defined by   The fish is initially placed in the mid-line of the swimming area with its initial distance between the head and the cylinder randomly varying from 1.5L to 2.5L. Figure 16 shows the traces of the head during different learning stages and the total number of periods the fish maintains in the swimming area for all episodes considered. Figure 16a shows the traces of the head at different learning stages. At episode 100, the fish is not able to hold position and swims out of the area instantly. At episode 600, the fish hold position for several periods but swims out of the area finally. At episode 1980, the fish is able to hold position in a small area for more than 200 periods. As shown in Fig. 16b, in the first approximately 500 episodes, the total number of swimming periods increases rapidly, indicating the fish is learning to hold position. After approximately 500 episodes, the fish is able to hold position in the Kármán vortex street for more than 200 periods, indicating the fish has found an effective swimming policy. Once an efficient swimming policy is achieved, 100 simulations of the swimmer swimming in the wake are conducted. The head location of the swimmer is recorded and shown in Fig. 17a Fig. 17b. It is found that the fish tends to hold position in a small area within the vortex street. Compared with the Kármán gaiting area observed by Liao 50 in live rainbow trout, the simulation-predicted area where the swimmer tends to stay overlaps the majority of part of that observed in experiment. The averaged undulation kinematics for 50 successful cases are shown in Table 2 with comparison to the experimental observation of Liao 50 . The Reynolds number of the cylinder in our simulation is 300 compared with 18,000 in the experiment. The resultant tail-beat frequency agrees quite well with that of the experiment but the tail tip amplitude is slightly lower. The undulation wave speed is also slightly slower than that in the experiment due to the smaller wavelength. It should be noted that the typical wake of the D-shaped cylinder in the high Reynolds numbers as observed in experiment is the well-organized turbulence vortex street (see Ref. 17 ), which is the foundation of the successful comparison between experiment and simulation as shown in Fig. 17.
The vorticity contours and the pressure distributions on the fish surface at different instants when the fish is holding its position in the vortex wake are shown in Fig. 18. It is found that there are at least three vortices that are interacting with the body at any instant. At tU/L = 54.4 ( Fig. 18a and b), vortex 1 is at the left side of the tail and vortex 3 is at the left side of the head, generating high leftwards suction force. Meanwhile, vortex 2 is at the right side of the middle body, generating high rightwards suction force to balance the leftwards suction force at the head and tail. At tU/L = 55.04 (Fig. 18c and d), the tail is moving from left to right, which leads to a suction force at the left side of the tail, generating leftwards and head-wards thrust. The fish has to balance this leftwards force with its muscles. Meanwhile, vortex 2 has moved to the right side of the tail, inducing a high suction force at the right side which balances the suction force at the left side. In addition, vortex 3 has moved to the left side of the middle body, inducing a backwards drag and a leftwards force which balance the head-wards and rightwards force induced by vortex 4. This facilitates the motion of the tail. At tU/L = 55.52 ( Fig. 18e and f), vortex 3 has moved to the posterior body, inducing a leftwards and a backwards force which balance the head-wards force induced by vortex 5 and the rightwards force induced by vortex 4. At tU/L = 56 ( Fig. 18e and  g), the tail is moving from right to left. The leftwards force induced by vortex 3 facilitates this movement. The vortex position is similar to the situation at instants tU/L = 55.04 while the leftwards force induced by vortex 3 and vortex 5 is balanced by the rightwards force generated by vortex 4 and the tail movement. The backwards force induced by vortex 3 is balanced by the head-wards force induced by vortex 5 and the tail movement. To summarize, the fish can use the vortices to achieve balance and save energy so as to efficiently hold position in the Kármán vortex street.
However, the fish could occasionally get trapped in the low pressure area of the vortex center. If the swimmer is not able to properly synchronize with the vortices, it would move downstream with the vortex and lose its stability. In this case, the fish must find a way to escape from the vortex in order to hold position for a long period. Figure 19 shows the strategy of how the fish escapes from the vortices. At tU/L = 86.88 ( Fig. 19a and b), 87.2 ( Fig. 19c and d) and 87.52 ( Fig. 19e and f), the fish is in close proximity to a left side vortex (vortex 1) which induces a high suction force on the left side body. In order to escape from the vortex, the fish performs a fast high-amplitude leftwards flapping to generate high rightwards forces on the tail. At tU/L = 87.84 ( Fig. 19g and h) and 88.15 ( Fig. 19i and j), the fish sweeps its tail back to the central area of the vortex street and the leftwards suction force of vortex 1 is partly balanced by the rightwards suction force of vortex 2. Afterwards, the tail is slowly moving from the left side to the right side to avoid generating high leftwards force on the tail.

Conclusions
The fish adaption behaviors in complex environments have been numerically studied. A recurrent Q-network is first coupled with an immersed boundary-lattice Boltzmann method to simulate the adaption behaviors of a fully self-propelled smart swimmer. Three different behaviors are studied with this swimmer: point-to-point swimming in a quiescent flow, rheotaxis swimming in uniform flow and position holding swimming in a Kármán vortex street. The swimmer utilizes only the position, velocity or acceleration information extracted from the environment to learn to achieve specific tasks. By considering the historical information, the swimmer learns suitable policies to achieve different tasks, demonstrating that deep reinforcement learning is able to extract useful characteristics from flow structures with various complexities. During the point-to-point swimming, the fish performs rapid turning to face the target and then swims directly to it with different initial distances and orientation angles. During rheotaxis swimming, the fish rapidly aligns its body with the uniform flow and holds position for more than 200 periods. Two types of wake vortex patterns are identified for rheotaxis swimming. The vortex patterns produce jet flow in different directions in the wake to facilitate a dynamic balance of the hydrodynamic forces. During position holding in a Kármán vortex street, the fish utilizes the ambient vortices to achieve balance and save energy. The robust position holding in the Kármán vortex street only happens in a specific flow area which is in reasonable agreement with the experimental observation of Liao 50 . Highly asymmetrical corrective undulation is performed when fish is trapped in the vortices, which enables the fish to escape from the vortex center and hold its position or maintain its stability.