Leader–follower UAVs formation control based on a deep Q-network collaborative framework

This study examines a collaborative framework that utilizes an intelligent deep Q-network to regulate the formation of leader–follower Unmanned Aerial Vehicles (UAVs). The aim is to tackle the challenges posed by the highly dynamic and uncertain flight environment of UAVs. In the context of UAVs, we have developed a dynamic model that captures the collective state of the system. This model encompasses variables like as the relative positions, heading angle, rolling angle, and velocity of different nodes in the formation. In the subsequent section, we elucidate the operational procedure of UAVs in a collaborative manner, employing the conceptual framework of Markov Decision Process (MDP). Furthermore, we employ the Reinforcement Learning (RL) to facilitate this process. In light of this premise, a fundamental framework is presented for addressing the control problem of UAVs utilizing the DQN scheme. This framework encompasses a technique for action selection known as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon$$\end{document}ε-imitation, as well as algorithmic specifics. Finally, the efficacy and portability of the DQN-based approach are substantiated by numerical simulation validation. The average reward curve demonstrates a satisfactory level of convergence, and kinematic link between the nodes inside the formation satisfies the essential requirements for the creation of a controller.

The development of unmanned aerial vehicles (UAVs) has been widely applied in various fields.UAVs can be combined with existing technologies to form more intelligent and efficient solutions that meet the needs of modern society.Aerial photography, unmanned warehousing, express logistics, rescue, and other fields have begun to explore the application of UAVs, and there will be more fields in the future that can unleash its potential 1 .
The integration of UAVs and artificial intelligence (AI) algorithms through the advancement of unmanned cluster technology has enabled the accomplishment of missions with enhanced efficiency and intelligence.Artificial intelligence algorithms can provide intelligent decision-making and control capabilities for UAVs 2,3 .For example, through deep learning technology, the autonomous perception and target recognition of UAVs can be realized, so that they can autonomously avoid obstacles and identify targets.Through the Reinforcement Learning (RL) algorithm, UAVs can be autonomously explored and learned in an unknown environment, achieving more flexible and intelligent task execution.Through the combination of UAVs and artificial intelligence algorithms, the collective intelligence and collaborative working ability of UAVs can be realized, and the task execution efficiency, ability to deal with complex environments, and autonomous decision-making capabilities of UAVs can be improved.This has important application value for some scenarios that require large-scale and complex task execution, such as disaster relief, agricultural plant protection, logistics distribution, etc.
To achieve the autonomous planning of UAVs in response to dynamic environmental conditions and facilitate collaborative efforts towards accomplishing mission objectives, Xu et al. implemented a novel MARL framework.The organization had adopted a strategic approach of centralized training coupled with decentralized execution, the utilization of the Actor-Critic network was employed to ascertain the execution activity and thereafter evaluated its efficacy.The new algorithm implemented three significant enhancements derived from the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method.Simulation results demonstrated a clear enhancement in learning efficiency, and there was an enhanced improvement in the operational safety factor in comparison to the preceding algorithm 4 .Hossein et al. proposed the implementation of a system that employs Deep Reinforcement Learning (DRL) as a means to tackle the difficulties associated with autonomous waypoint planning, trajectory tracking, and trajectory creation for multi-rotor UAVs.The present framework incorporated a DRL www.nature.com/scientificreports/implemented using an actor-critic architecture, enabling it to effectively execute a dual-channel continuous control of the UAV, specifically controlling the roll and speed.The training experiments demonstrated significant enhancements in terms of convergence speed, convergence effect, and stability 21 .Xu et al. advocated the utilization of the DDPG algorithm for the purpose of achieving autonomous morphing control and the adoption of the MADDPG algorithm to facilitate cluster cooperative operations, so enabled the attainment of autonomous and cooperative fighting capabilities in morphing UAV clusters.The objective was to achieve intelligent cluster combat and gain control over the future air combat initiative 22 .Tožička et al. devised a control system for a fleet of several UAVs by leveraging insights from the latest advancements in the field of deep reinforcement learning.The control policy for this study was selected to be a deep Convolutional Neural Network (CNN) with a linear output layer.This choice was made based on the extensive range of applications that CNN had demonstrated.The training of the control policy was conducted in a simulated environment including five UAVs.The control policy was implemented and executed on a fleet consisting of five DJI Mavic Pro drones, yielding satisfactory results 23 .
UAVs, discussed in this paper utilizing a leader-follower framework, wherein a single UAV assumes the role of the leader while other UAVs act as followers, collectively forming a formation.Given the control design requirements that have emerged, it is imperative for the follower to consistently uphold the leader's relative distance and adapt to any associated variations in parameters, even in the event of leader maneuvering.When examining the modeling characteristics of collaborative control problems in leader-follower UAVs, it becomes evident that a DRL-based algorithm is the most suitable option for addressing the challenges posed by the UAVs dynamic nature and the uncertainties present in their flight environment.The DQN-based approach, which is widely employed in the field of DRL, has the capability to acquire intricate decision-making methods without any prior knowledge.Moreover, it has the capability to efficiently enhance performance in large state spaces.This study focuses on the development of a cooperative control algorithm for UAVs based on the features of their flight environment.In Section Method, we firstly provides an illustration of the environment in which UAVs operate.The dynamic model is a representation of the joint state of an intelligent UAV system.This model encompasses the velocity, relative locations, heading angles, and rolling angles of both the leader and the followers.Furthermore, the collaborative control process of leader-follower UAVs is characterized as a Markov decision process (MDP) model.This will go into the action space, state space, and reward function associated with this model.Then, we will describe the fundamental structure of the control problem for leader-follower UAVs, utilizing the DQN approach.The proposed approach includes the strategy of ε-imitation for action selection, the creation of Q-network, and the specification of algorithm details.In this study, the suggested DQN-based method is subjected to numerical simulation tests in order to validate its convergence and portability.The main contribution of this study is illustrating a collaborative framework that utilizes an intelligent deep Q-network to regulate the formation of leader-follower UAVs.This framework addresses the challenges posed by the highly dynamic and uncertain flight environment of UAVs.The study proposes a novel intelligent control strategy for the cooperative control of UAVs, utilizing a DQN algorithm.The proposed system joint states encompass the relative position, heading angle, and rolling angle in the formation of UAVs, as defined in the environmental context.Additionally, the study modifies a MDP model to depict the collaborative control process of UAVs, using the fundamental scheme of RL.With the analysis of the simulation results, it showed a reasonable convergence in the UAVs collaborative control process.The work mentioned in this manuscript will provide a new view for the UAVs collaborative control problem, and the control strategy training in the numerical simulation environment can be directly transferred to the hardware in the loop simulation system without too much parameter adjustment.

Motion equation of the UAV
Given the assumption that the UAV maintains a constant altitude, the mathematical representation of the system can be reduced to four independent variables, hence simplifying the model to four degrees of freedom.In order to make up for the loss caused by simplification and consider the influence of environmental disturbance, randomness is introduced into each sub-state such as roll and airspeed, and the obtained stochastic UAV kinematics model is shown as Eq.(1).where x, y , ψ , φ and V are the position, the heading angle, the rolling angle, and the velocity of the UAV, respec- tively.a g is the gravity acceleration.η x , η y and η ψ represent the disturbances of the state variable.All data points in the sample adhere to the normal distribution.The statistical measures of central tendency and dispersion, specifically the mean values and variance, are provided in Table 1.f (φ, φ d ) represents the relationship between the rolling angle φ and its desired value φ d .Second-order system response is used to simulate the dynamic response of UAV rolling channels.

UAVs system model
The utilization of the leader-follower control method is a significant technical strategy that plays a crucial role in ensuring the coherence and effectiveness of collaborative control systems 24,25 .The leader is an exceptional individual.As the individual assuming leadership within the formation, they possess the authority to determine and direct the trajectory of the formation, so that it is not affected by external influences.Nevertheless, the follower is not obligated to perceive the target data of the formation, but rather solely relies on the information provided by the leader 26 .Hence, a drawback of the leader-follower control approach relates to the inherent www.nature.com/scientificreports/independence between the leader and the follower, making it challenging to obtain feedback on tracking errors from the follower.
In order to depict the relative positioning of the leader-follower arrangement of UAVs, a coordinate system is developed with the follower UAV serving as the reference point, as seen in Fig. 1 27 .
In Fig. 1, the coordinates X L and Y L represent the inertial system of the leader.X F and Y F represent the inertial system of the follower.In the follower's velocity coordinate system, x F and y F represent the relative distances between the leader and the follower.V L and V F represent the velocity of the leader and the follower.ψ VL and ψ VF denote the heading angles of the leader and the follower, respectively.
The control of the UAV is achieved by changing the roll angle setting value.The control strategy periodically updates the roll command at a frequency of one time per second, while the self-driving instrument effectively executes the low-level closed-loop control within the specified interval 28 .With consideration of the data presented in Fig. 1, the states S of UAVs, which depict the relative link between the leader and the follower, can be represented as 29   where S 1 are the differences of the heading angles between the leader and the follower.S 2 and S 4 are the leader's rolling angle and its desired value, S 2 represents the follower's rolling angle.S 5 and S 6 are the differences of the relative position in x, y direction between the leader and the follower.During actual flying operations, the control commands issued by the leader will be modified in response to the prevailing conditions on the battlefield.To enhance the adaptability of the model to dynamic input uncertainty, control instructions will be either constant or randomly produced through user functions in Results Section.

MDP model for UAVs collaborative control
Based on the aforementioned material, it is evident that the control issue pertaining to UAVs can be characterized as a multi-step decision-making problem.At its essence, this problem entails the selection of the suitable control command for the roll angle, as well as the determination of the optimal timing for executing and releasing the order decisions.This work presents a novel approach for addressing the control issue in collaborative control of UAVs, utilizing an intelligent and efficient control mechanism.The activities of UAVs have been reinterpreted within the context of MDP.The fundamental MDP paradigm is depicted in Fig. 2.
The representation of a discrete MDP can be achieved through the utilization of a quintile array denoted as {S, A, R, P, J} .The state space, denoted as S , is partitioned based on the attitude and relative position of the leader and followers.The action space A consists of the control instructions for the follower's rolling angle.R represents the return values associated with actions and states.R illustrates the transition percentage between states, and J represents the optimization objective function of the control decision.The properties of a discrete MDP are as follows.
where p ij (a k ) represents the probability of transitioning from state s i to state s j when the action a k is executed in the given state.
The learning effect of the formation controller in the discrete MDP model is directly influenced by the range and precision of the discrete parameters in the state space S .The selection of state space parameters for the formation MDP model in the warfare process of UAVs encompasses aspects such as the relative position and attitude between the leader and the follower.The other four parameters A, R, P, J of the MDP model are primarily developed based on the intended mission target.Action space A incorporates the rolling angle's operation.The reward function R is built by utilizing UAVs to measure the distance values between the real-time positions of (2) www.nature.com/scientificreports/various members inside a formation.The transition probability P is contingent upon the precise location of the UAV subsequent to the execution of the action.The objective function J represents the total return value.As the action selection strategy, J * represents the optimal return value: where γ ∈ (0, 1) illustrates the return discount factor, r t indicates the return value of time t.

State space
The representation of UAVs can be achieved through the utilization of a multidimensional array.The establishment of a collaborative control problem inside the leader-follower topology necessitates careful consideration of the relative link between the leader and the follower.Factors such as heading difference and distance play a pivotal role in shaping the formulation of the control strategy.System state is used to represent the state space, which serves to describe the pose relationship and relative spatial location between the leader and the follower.
In real engineering applications, based on the relative position relationship, the determination of the control command of the leader is contingent upon the flight control system.The primary focus of this study pertains to the development of a collaborative control architecture.In order to enhance the model's ability to handle diverse inputs, a random function is employed to create the control instruction of the leader during the training process of the DQN.This approach aims to imitate the inherent uncertainty associated with system input.The state space of a MDP scheme for UAVs can be denoted as S = {S 1 , S 2 , S 3 , S 4 , S 5 , S 6 } , as stated in Eq. ( 2).

Action space
The manipulation of the UAV is achieved through the alteration of its rolling angle.The control approach involves updating the control command at a frequency of one time per second, and the lower closed-loop control is executed by the autonomous system within this specified duration.The action space encompasses the rolling angle command of the follower UAV, taking into account the UAV's maximum acceleration and the need to prevent abrupt changes in control commands that could disrupt the flight of the UAV.On one side, it is advantageous for the followers to closely align with the current status of the leader's movement.Conversely, it is imperative to mitigate the inherent instability of the UAVs structure.
The set of possible actions, denoted as A , that can be taken by the followers can be represented as: where φ max denotes the upper limit of potential actions for the rolling angle of the followers.The expected action for the subsequent time step is depicted as: where a φ has been selected based on the control demand of the followers.φ bd represents the thresholds associ- ated with the follower's rolling angle.

Reward function
In order to ensure proper configuration maintenance, it is imperative that each node inside the formation maintains a safe distance from its neighboring nodes.Insufficient space between nodes may result in collisions occurring among these neighboring entities.In the event that the distance is considerable, the delay time in communication will give rise to additional malfunctions 30 . Figure 3 illustrates a collision avoidance and reward evaluation scheme based on the intended elevated reward and the range (d O , d I ) between the leader and the following UAVs.Each node will receive a reward value from the leader based on the proximity to its neighboring nodes.The individuals inside the group will modify their conditions in accordance with the reward values provided.The construction of an acceptable reward function is crucial in the field of reinforcement learning.The cost function for collaborative control of UAVs is formulated and the reward function has been defined 31 .The reward function primarily takes into account the distance of UAVs, as depicted in Fig. 3.The value of reward limits ensure that the followers remain within the distance of the UAVs once the action has been executed.The reward function is depicted as: where r represents the immediate reward.The inner radius and outer radius in Fig. 3 are denoted as d I and d O , respectively.D illustrates the spatial separation between the follower and the circular object.The adjust factor, (4)

Fundamental framework
In the context of UAVs, it is observed that subordinate units obtain the relevant system status information from the commanding unit.The selection of actions in the control system is determined by the action-selecting strategy.The value of reward function is then calculated based on the feedback received from the updated system state information, which is obtained after the execution of the selected action.This study aims to reassess the benefits and drawbacks of the action plan by using the real-time rewards obtained by UAVs and optimizing the cumulative return.The Q-learning algorithm, within the context of this theoretical framework, is responsible for storing and estimating the action value function of the follower in various states within the MDP model.Additionally, it utilizes the real-time system state information provided by the pilot to iteratively renew the action value function.This iterative process aims to solve the optimal sequential decision-making problem associated with the follower actuator.The value function estimation Q(s t , a t ) of the action a t executed by the follower in state s t is determined: where s 0 represents the initial state of UAVs, a 0 illustrates the first action of the follower.
Based on the pertinent theory in the field of operations, Q(s t , a t ) can be observed to satisfy the Bellman equation as follows: where p(s t , a t , s t+1 ) shows the probability of state s t transition to state s t+1 with actions a t .r(s t , a t , s t+1 ) represents the return value of state s t transition to state s t+1 with actions a t .
The optimal strategy Q(s t , a t ) of Q-learning relates to maximize the accumulative return value, hence the strategy may be formulated as: In the field of RL, agents engage in ongoing interactions with their environment through a process of trial and error.The objective of this iterative process is to acquire an optimal strategy that maximizes the cumulative reward obtained from the environment 32 .In the Q-learning method, the determination of the Q-value function enables the identification of an optimal strategy.This is achieved by employing the greedy approach, where the agent selects the action indicated by the maximum Q-value at each time step.The Q-learning technique is commonly employed and very straightforward to implement.Nonetheless, it is still confronted with the challenge of the dimensional disaster.The approach commonly use tabular representation for storing Q values, rendering it unsuitable for reinforcement learning issues characterized by high-dimensional or continuous state spaces 33 .
The utilization of deep neural network (DNN) as function approximators for estimating Q values has emerged as a viable approach for addressing the aforementioned challenge 34 .Minh et al. demonstrated the utilization of CNN and empirical playback technology for the implementation of a Q-learning algorithm, exemplifying the application of a DQN method 35 .To mitigate the uncertainty of the neural network approximation function, a distinct target network was employed to generate Q values.This approach aimed to minimize the correlation between the predicted Q value, which is the output of the main network, and the target Q value, which is the output of the target network.
Based on the equation presented as Eq. ( 8), it is necessary to construct the optimal policy subsequent to the attainment of the maximum Q-function.By employing the recursive framework, the Q-function can be iteratively renewed as 36 , where represents the learning rate.
The target Q value is illustrated as: where θ − shows the parameter of the target network.
The minimize loss function can be shown as: As demonstrated by the DQN, the disparity between the assessed Q value of the main network and the Q value output of the target network is utilized to dynamically alter the parameters of the main network.In contrast to the real-time renewed parameters of the main network, the parameters of the target network are renewed at regular intervals of K time steps.The target network parameters are updated by copying the main network parameters at regular intervals of K time steps.
In this part, we propose a control strategy for collaborative control of UAVs based on the DQN algorithm.The DQN algorithm is an innovative adaptation of Q-learning that integrates reinforcement learning with artificial neural networks 35 .In order to mitigate the instabilities that arise from approximating the action value function (Q function) using neural networks, the DQN approach incorporates the utilization of a periodically renewed separate target Q-network and an experience replay mechanism.DQN algorithm has demonstrated successful applications in several sectors, including agriculture, communication, healthcare, and aerospace engineering 37 .The structure of the control algorithm based on DQN is depicted in Fig. 4.
As depicted in Fig. 4, the followers are assigned to the agents within the framework of RL.The agents acquire the control method and modify the network arguments through ongoing interactivity with the environment.The followers receive both the state message of the leader and their state information.The state message is combined to generate a joint system state S , which is then fed as input of the DQN process.The action selection policy, referred to ε-imitation where ε indicates the exploration ratio, determines the follower's rolling angle based on DQN's output.The action instructions issued by the leader and the followers are utilized as inputs in the kinematics model of UAVs to determine the state of both the leader and the followers at the subsequent time step.The value of the reward function, denoted as R , as well as the system state, denoted as S′ , at the next time step can also be obtained.S, A, R, S ′ values throughout the interaction process are preserved in the experience pool.During each iteration, the experience pool is subjected to random sampling, and afterwards, the network arguments of the DQN are modified.Once the predetermined number of time steps is reached in each round, the ongoing episode concludes and the subsequent episode commences.

Action strategy
In order to enhance the learning productivity of DQN during the training phase, a novel action favour approach called ε-imitation is proposed.This method combines the ε-greedy strategy and the imitation strategy, aiming to strike a balance between exploration and exploitation in the learning process 31 .The imitation method involves the follower selecting its control instruction based on the control demands related to relative distance.The essential concept of this method is that when followers make a selection from the action space with a chance of 1 − ε , the chosen action is determined by the expected relative distance between the leader and the follower, as outlined in Eq. (2).When the separation between the leader and the follower exceeds the designated safety limits (d O , d I ) , a derivative action A max is selected from the available set of actions in order to mitigate the danger of collision.In the event that the relative distance is deemed to be within a secure range, it is advisable for the follower to sustain their present conditions, resulting in a state of inaction or zero action.The utilization of ε-imitation action selection strategy in the context of topology maintenance in UAVs flying has several advantages, including the reduction of follower blindness during the first exploration period.Additionally, this approach mitigates the occurrence of invalid explorations, enhances the quantity of positive samples within the experience pool, and contributes to the optimization of training efficiency.
ε-imitation action selection strategy can be observed in Algorithm 1 37 .

DQN method
The Q function in the DQN framework is estimated by the utilization of a neural network, known as a Q-network, which is characterized by its weight parameters denoted as θ .In order to assess the Q value, a fully linked Q-network is constructed, as depicted in Fig.

Results and discussion
The training process is finished in MATLAB which includes 50,000 episodes totally.In each episode, the simulation time is 60 s.Before the formal training, a pre-training of 200 episodes is conducted to collect experiential data for batch training.During the training process, the exploration probability ε linearly decreases from the initial value of 1 to the minimum value of 0.1 over 10,000 episodes, and the update period of the target network K gradually increases from 1000 to 10,000 in the initial 1000 episodes.
The simulation data-set consists of UAVs with a leader and two followers which monitoring the configuration 38 .The parameters pertaining to the dynamics of UAVs are presented here.
Table 1 provides a comprehensive overview of the specific parameter configurations for the DQN algorithm.Table 2 displays the physical characteristics of both the leader and the follower.
In order to assess the efficacy of the DQN-based method utilized in this study, an average reward R Avg was established as the evaluative standard.The variable R Avg is formally delimited as 31 , where r represents the immediate reward in Eq. ( 7). ( 14)  From what is shown in Fig. 6, after the curves exhibit convergence, the average rewards tend to hover around a value of − 7.
According to the highest reward of the episode, the results are shown as follows.Figure 7 shows the trajectory curves and distance change in different directions of leader and a pair of followers.Based on the observed patterns of the curves, it can be inferred that the followers are consistently converging towards the trajectory of the leader in both the X and Y directions.
Figure 8 depicts the comparative distance-time profiles of the leader and the followers.The adjustment of the reward function reveals the establishment of a secure distance range of 40-60 m between the leader and the followers.Moreover, the followers also maintained a safe relative distance from each other, and there is no risk of collision during the simulation.
In Fig. 9, it shows that the heading angle of followers changes with the leader.Along with the heading angle change of the leader, followers gradually adjust their heading angle and fly with the leader in a similar trends.After 45 s of simulation time, it shows that the heading angle changes of the leader and the followers tend to be consistent.
Figure 10 illustrates the changes in rolling angle and its desired value of followers.In order to collaborate with the leader to fly within a safe distance, the rolling angle controls of followers are adjusted continuously according to the action space and ε-imitation action selection strategy.As is shown in this figure, followers' rolling angle and commands are all changing with the leader's roll angle which are in a range from − 20° to 20°.

Conclusion
The objective of this study was to devise an innovative approach for addressing the challenges posed by nondeterminacy, non-linearity, systematical error, and disturbances in the modeling of UAVs.A novel intelligent control strategy was proposed for the cooperative control job of UAVs, utilizing a DQN algorithm.The proposed system joint states encompassed the relative position, heading angle and rolling angle in the formation of UAVs, as defined in the environmental context.Subsequently, a MDP model was modified to depict the collaborative control process of UAVs, using the fundamental scheme of RL.Afterwards, the comprehensive DQN algorithm was presented, encompassing the fundamental structure, the -imitation action selection technique, and a complete account of the algorithm.In order to assess the effectiveness and practicality of the DQN control technique mentioned in this research, a simulation experiment was conducted.
Based on the outcomes of the simulation, it can be observed that the algorithm based on DQN exhibits a distinctive level of behavior in the context of cooperative control of UAVs.The average total reward profile demonstrates a satisfactory level of astringency, indicating that the collaborative controller design meets the necessary requirements for the relative kinematic link among different nodes in the formation.In subsequent research endeavors, there will be an expansion towards the development of a high-fidelity hardware-in-the-loop simulation system.This system will be designed to assess the efficacy and adaptability of the DQN-based method.The Table 2. Arguments of the leader and the follower.

Figure 1 .
Figure1.The relationship between the leader and the follower within the inertial coordinate system.

Figure 2 .
Figure 2. The framework of collaborative control for UAVs based on MDP.

Figure 3 .
Figure 3.The scheme for collision avoidance in formations of UAVs.

Figure 4 .
Figure 4.The framework for collaborative control of UAVs based on the DQN control algorithm.

Figure 5 .
Figure 5.The conceptual structure of a Q-network.

Figure 6
Figure 6 illustrates the fluctuations in the average rewards across all training episodes.The results indicate that the average rewards reach a state of convergence following around 10,000 training events for a pair of followers.From what is shown in Fig.6, after the curves exhibit convergence, the average rewards tend to hover around a value of − 7.According to the highest reward of the episode, the results are shown as follows.Figure7shows the trajectory curves and distance change in different directions of leader and a pair of followers.Based on the observed patterns of the curves, it can be inferred that the followers are consistently converging towards the trajectory of the leader in both the X and Y directions.Figure8depicts the comparative distance-time profiles of the leader and the followers.The adjustment of the reward function reveals the establishment of a secure distance range of 40-60 m between the leader and the followers.Moreover, the followers also maintained a safe relative distance from each other, and there is no risk of collision during the simulation.In Fig.9, it shows that the heading angle of followers changes with the leader.Along with the heading angle change of the leader, followers gradually adjust their heading angle and fly with the leader in a similar trends.After 45 s of simulation time, it shows that the heading angle changes of the leader and the followers tend to be consistent.Figure10illustrates the changes in rolling angle and its desired value of followers.In order to collaborate with the leader to fly within a safe distance, the rolling angle controls of followers are adjusted continuously according to the action space and ε-imitation action selection strategy.As is shown in this figure, followers' rolling angle and commands are all changing with the leader's roll angle which are in a range from − 20° to 20°.

Figure 6 .Figure 9 .
Figure 6.The variation of the average rewards.

Figure 10 .
Figure 10.The comparison of rolling angle and its desired value for the leader and the followers.

Table 1 .
: Arguments of the DQN method.
ψ 's mean value and variance η ψ , σ ψ 5. At time step t , the state of the UAVs is accepted by the input layer of the Q-network.Every individual node within the output layer of the neural network corresponds to the Q value associated with a specific action within the set of all possible actions.The network is comprised of two hidden layers, single input layer and single output layer.Dimensions of the hidden layer are set to 40 × 40, and the training function is specified as Variable Learning Rate Gradient Descent.UAVs coordination control is accomplished by using the DQN algorithm.Figure4illustrates the training process, depicting the key implementation elements of the collaborative control method based on DQN in Algorithm 2.