Synthetic aircraft RS image modelling based on improved conditional GAN joint embedding network

Developing an efficient and quality remote sensing (RS) technology using volume and efficient modelling in different aircraft RS images is challenging. Generative models serve as a natural and convenient simulation method. Because aircraft types belong to the fine class under the rough class, the issue of feature entanglement may occur while modelling multiple aircraft classes. Our solution to this issue was a novel first-generation realistic aircraft type simulation system (ATSS-1) based on the RS images. It realised fine modelling of the seven aircraft types based on a real scene by establishing an adaptive weighted conditional attention generative adversarial network and joint geospatial embedding (GE) network. An adaptive weighted conditional batch normalisation attention block solved the subclass entanglement by reassigning the intra-class-wise characteristic responses. Subsequently, an asymmetric residual self-attention module was developed by establishing a remote region asymmetric relationship for mining the finer potential spatial representation. The mapping relationship between the input RS scene and the potential space of the generated samples was explored through the GE network construction that used the selected prior distribution z, as an intermediate representation. A public RS dataset (OPT-Aircraft_V1.0) and two public datasets (MNIST and Fashion-MNIST) were used for simulation model testing. The results demonstrated the effectiveness of ATSS-1, promoting further development of realistic automatic RS simulation.

1. In ATSS-1, a novel GAN model (an adaptive weighted conditional attention generative adversarial network, AWCA-GAN) is integrated to an improved embedding network (geospatial embedding network, GE) to form one system, correlates with real RS scene, and simulates RS data without complex manual manipulation. 2. An adaptive weighted conditional batch normalisation attention mechanism module (AWCBNA) in AWCA-GAN is presented, applying intra-class-weighted feature statistics to recalibrate category feature responses adaptively for the conditional parameters. Such AWCBNA block enables the network to emphasise the category features selectively and enhances the ability of presentation learning in different types to alleviate the entanglement of subclass features. 3. An asymmetric residual self-attention module (ARSA) is an add-on to the AWCA-GAN that captures the spatial geometry and spectral information by establishing a remote region asymmetric relationship to obtain a more robust feature representation. 4. An efficient embedding network, called a geospatial embedding network (GE), is investigated, which could map a real RS scene into a latent space of the simulated target, which is drawn from prior distribution z as the intermediate representation through the trained AWCA-GAN model. 5. In the experiment, we collect aircraft datasets (namely, OPT-Aircraft _V1.0 30 ) and investigate the effect of aircraft RS-image simulation on high-level visual tasks . This dataset is a fine-grained public dataset of aircraft in the RS field. It can provide benchmark data for future research in RS fine-grained identification, classification and processing. Figure 1 depicts the overall workflow of our ATSS-1. The scheme starts from the RS aircraft datasets: First, we use an improved conditional GAN to learn the distribution spaces of multiple aircraft type samples from random distributions such as Gaussian or uniform distribution. Then, we use random vectors with labels as intermediaries to calculate the transformation space from the region to be simulated to the RS target. Finally, we apply the Poisson blending to synthesise the final RS images. The core part mainly includes two sub-networks. 1) RS-image Scientific Reports | (2022) 12:320 | https://doi.org/10.1038/s41598-021-03880-x www.nature.com/scientificreports/ modelling AWCA-GAN, 2) embedding network GE. For the first part, RS-image modelling utilises AWCA-GAN to achieve more elaboration latent space representation for different RS-image types. The second sub-network GE extracts an optimizer mapping from input real RS scene to latent space. Because of these two sub-networks, the potential space of generative network and features of input real RS scene associated.

Methods
Simulated images modelling: AWCA-GAN. The specific network structure of the AWCA-GAN illustrated in Fig. 2, consists of a discriminator and a generator. In the AWCA-GAN generator, a fully connected layer is fed to extract the potential characteristics from the input. Then, three residual modules (ResBlk_G_1 to ResBlk_G_3) with global spectral normalisation (SN) layers 32 are stacked to relieve the gradient vanishing and mode collapse. Next, AWCBNA is added after ResBlk_G_1 to reallocate the inter-class-wise feature responses adaptively. After that, one ARSA module in the generator is employed to obtain a more robust feature representation by establishing the remote regional relationships to extract the global geometric features. Thereafter, we use the ReLU activation function to optimise the generator convergence to the global optimum due to its advantages of unilateral inhibition, relatively wide excitation boundary, and sparse activation except in the final layer of the generator. At the same time, we use a convolutional layer with the tanh activation function to constrain the output ranges from -1-1 to ensure that the data range is consistent with the training set when it is sent to the discriminator. Furthermore, to introduce more significant nonlinearity and accelerated convergence into the discriminator, we select the Leaky-ReLU instead of the ReLU as the activation function to keep the training process in Nash equilibrium and set the leaky value equal to 0.2 is empirically obtained for good performance. Similar to cGANs of projection discriminator 24 , AWCA-GAN uses projection in the discriminator. Figure 2b and c for more detail on ResBlk_G and ResBlk_D. Finally, we use the hinge loss 21,22 of the standard conditional adversarial loss to guide the network training effectively.
Adaptive weighted conditional batch normalization attention block: AWCBNA. To develop a more robust learning ability of category feature generation, an AWCBNA module is developed to unmix the feature of inter-class by exploring adaptive category-weighted conditional batch normalization statistics. Figure 3 shows an intermediate data cube F ∈ R W×H×C being fed into a 1 × 1 × 1 convolutional kernel to learn the spatial adaptive weighted matrix A ∈ R W×H×1 . Then, an activation function (Softmax) is applied to normalise A and multiply F with A to get adaptive weighted channel-wise feature map M ∈ R C×1 . To make  where γ (c) , β(c) are the trainable scale and bias parameters specific to the c class, respectively. To further optimise the inter-class features, we adopt a simple strategy of applying two convolutional layers with 1 ×1×C/4 kernel, 1 ×1× C kernel, and activation functions (Leaky-ReLU and Sigmoid). As a result, we obtain channel attention h(F) and then combine the channel attention h(F) and parameter ζ to rescale the intermediate data cube F.
where h(F) and ζ are the scaling factors, ζ initialise to 0. Through the training of h(F), ζ , and the operating factor of AWCBN(·) , AWCBN could recalibrate inter-class-wise characteristics to boost the conditional generated ability.
Asymmetric residual self attention block: ARSA. To achieve a more power feature-related learning, we design an ARSA block based on a self-attention mechanism 21 , which retains meaningful long-distance features and suppresses interference information by establishing asymmetric global regional relations. The specific structure is shown in Fig. 4.
Just like self-attention 21 , it contains three branches: F-branch f (x i ) , G-branch g(x i ) , and H-branch h(x i ) . f (x i ) and g(x i ) are used to calculate the attention level of the spatial position f (x) = SN W f * x and g(x) = SN W g * x , SN(·) represents SN 32 . To obtain more optimised spatial characteristic parameters, before F-branch and G-branch are multiplied, F-channel is subsampled. Thereafter, an activation function (Softmax) is used to obtain different attention levels. We define the above process as asymmetric spatial attention ASA( · ): where x i represents the feature cube entering the ARSA block; Down(·) represents the max-pooling.
Integrate global spatial information and local information h(x) = Down(SN(W h * x)) through h(x i ) . Finally, the asymmetric attention map is mapped to the input channel through a convolutional layer with a kernel size of 1*1. The output result obtains by combining the input data cube x and parameter γ .
Where γ is a learnable scalar that initialises as 0 and allocates more weight to the ASA map piecemeal according to the global feature in the training process. Geospatial embedding network: GE. This section focuses on the process of mapping from an input sample I ∈ R n×n×3 to the latent space (i.e., generated image G z * , y ) on trained AWCA-GAN. If the prior dis-  www.nature.com/scientificreports/ tribution z * ∈ R n is taken as the intermediate feature representation to establish the projection relationship, it can be reduced to the following minimisation problem: z * = min z −E x D(G(z, y), y). GE follows a direct forward optimisation framework embedding a given sample into the manifold projection of a pre-trained generator. Acting from a feasible initial latent vector z, the search for an optimised vector z* by using gradient descent to minimise the loss function that monitors the semantic and multiple-level perceptual diversities between the feature of inputting samples and simulated RS data G z * , y from z * ∈ R n . For the loss function, we define three weighted combination loss functions of style loss L style , type loss, and pixel-wise MSE loss: where I ∈ R n×n×3 is the pre-selected inputting sample, G(·) is the trained generator of AWCA-GAN, N is the number of scalars in the input samples, z * is a prior vector to be optimised. α , β , and γ are controllable hyperparameters. α = 0.4 , β = 0.2 , and γ = 0.4 are empirically obtained for good performance.
In Eq. 4, the first, second,and third term are style loss, type reconstruction loss, and pixel loss, respectively. The type reconstruction loss measures the difference between the generated image by z * and the pre-set type, which calculates the minimum value in the final feature space of the trained discriminator. Pixel loss is the (normalized) Euclidean distance between the generated image G z * , y and the input samples I. The style loss term L style (·) penalises the differences in style: radiation, textures, etc.
where x 1 , x 2 ∈ R n×n×3 are the input two samples, D(·) is the feature map output of the trained discriminator layers ResBlk_D_1, ResBlk_D_2, and ResBlk_D_3 respectively, N i is the number of scalars in the i-th layer output data cube, i is controllable hyperparameters for i-th layer, and i = 1 are empirically obtained for good performance.

Related work
Dataset and evaluation metrics. Datasets To experimentally verify the algorithm's effectiveness on the RS images, we collected an RS dataset (namely OPT-Aircraft_V1.0) and made them available in the literature 30 , which contains 14 aircraft types for remote sensing globally, with a total number of 3,594 and a size of 96 × 96 × 3. However, some aircraft types are not suitable for the study of generating models due to their small number of samples. Considering this, we choose seven aircraft types, including 656 Swept-back wing aircraft I, 320 Swept-back wing aircraft III, 75 Swept-back aircraft with leading-edge II, 192 Delta-wing aircraft, 1088 Flatwing aircraft II, 414 Propeller aircraft II, 242 Propeller aircraft III, with a large number of samples to guide the learning of the proposed algorithm. In addition, we augmented the data in the dataset. We performed rotations of the data by 90 • , 180 • , and 270 • , mirror image flipping, up-and-down flipping, and seven other operations to keep the subclass samples balanced and obtained a total of 30,464 aircraft RS images. Select 90% of the aircraft samples randomly as a training set and 10% of the aircraft samples as a testing set. The training set of aircraft is named OPT_Aircraft-7. All the data was limited to 32 × 32 × 3. Besides, to verify the robustness of the model, we selected MNIST 33 , Fashion-MNIST 34 as a more detailed illustration.
Evaluation metrics One metric index used in Fréchet Inception Distance (FID) index 35 to evaluate the performance of the generative model. A smaller FID indicates a better generative model.
where µ g and g represent the mean and variance of the generative samples, and µ data and data represent the mean and variance of the training samples, respectively. In this experiment, we used Inception-v3 as the classifier and used 1,000 dimensions of the penultimate layer of the network as the feature layer.

Implementation detail.
(1) Experimental configuration: The experiment framework used was Python 3.6, the programming language was TensorFlow 1.14.0, and the computer operating environment was Ubuntu 18.04.3. The computer processor was an Intel Xeon®E5-2678 V3; the graphics card was a GeForce GTX 1080ti; the memory was 64 GB. (2) Image modelling: To prove the quality of the generated data, we fed MNIST 33 , Fashion-MNIST 34 , and OPT_Aircraft-7 30 datasets into AWCA-GAN. We trained the AWCA-GAN from scratch. In order to ensure more stable training of the generated model, we set an imbalanced learning rate according to reference 35 . The initial learning rate of the generator was 0.0001, and that of the discriminator was 0.0004. The maximum training epochs set at 30,000. The learning rate fell to 80% with every 300 training epochs. The entire process guided the generator towards the global optimum. At this learning rate setting, we observed no significant jump in sample quality or FID value during training. The loss of generator and discriminator  N(0, 1) , we selected a batch of pre-selected RS scene from the DIOR 36 , UCAS_AOD 37 , NWPU_VHR-10 38-40 , DOTA 41 dataset and high-resolution images in Google Earth and fed a 32 × 32 embeddable region to the GE network by minimising the loss function described by Eq. 4. In the GE experiment, we used Adam optimiser to train the prior variable z * for 5000 iterations with a learning rate of 0.001. An outstanding embedding process should that an optimised prior vector z * map to a desired simulated sample G z * , y . In this way, it produces an image G z * , y that could learn several high-level semantic information from the input RS scene when passed through the trained generator.

Visual examples of AWCA-GAN. We conducted experiments on MNIST, Fashion-MNIST and OPT_
Aircraft-7 datasets to verify the AWCA-GAN algorithm's effectiveness. As shown in Fig. 5, the images generated by AWCA-GAN were closer to the ground-truth images and provided complete texture in the geometric structure. As a result, a more precise outline obtained between the target and background, and the generated samples were more dynamic, structural and diverse.   As illustrated in the visualisation results of the two models on three datasets in Figs. 5 and 7, the generated images from AWAN-GAN come out more distinct and more delicate than that of CGAN and ACGAN. It is the same as expected in our experiment in that the loss value is more stable in the transmission process resulting in higher image quality. In addition, the diversiform results proved the AWCBNA to solve subclass feature entanglement better. The ARSA block could analyse the long-range correlation of feature regions better. Similarly, we observed AWAN-GAN results on MNIST and Fashion-MNIST datasets that synthesise high visual fidelity data.
To further verify the excellence of AWCA-GAN, we calculated the FID score on MNIST (300 epochs), Fashion-MNIST (800 epochs), and OPT_aircraft-7 datasets (10,000 epochs). Lower is better. Referring to Table1, in MNIST and Fashion-MNIST datasets, our model achieved impressive visual effects. In the OPT_aircraft-7 dataset, the FID score of AWCA-GAN improved significantly compared with that of the baseline GAN models. We could see from the score that the AWCBNA and ARSA promote effect on AWCA-GAN and improving the simulation results effectively.

Ablation analysis of AWCA-GAN.
To explore the effects of AWCBNA and ARSA in AWCA-GAN, we carried out several ablation studies by adding AWCBNA and ARSA to different layers on AWCA-GAN's generator. The detailed experimental results are listed in Table 2. Table 2 compared FID's score that AWCBNA and ARSA added into different layers. All models trained in three datasets of MNIST (300 epochs), Fashion-MNIST (800 epochs), and OPT_aircraft-7datasets (10,000 epochs).
The AWCBNA block at the low-level data cube (Blk-1) implemented better performance than the middle-tohigh level data cube (i.e., Blk-2 Blk-3). The FID of the model ' AWCA-GAN, Blk-1' was improved from 0.35 to 0.27 by ' AWCA-GAN, Blk-3' on MNIST. The reason for which is that the AWCBNA can untangle the entanglement features better on the initial feature maps and adjust the value of the inter-class-wise features by adaptive weighting category conditional batch normalisation statistics. However, it only played a minor role due to the short channels when modelling dependencies for bigger feature maps (e.g., ResBlk-2, ResBlk-3). In addition, the comparison of AWCA-GAN (5th column of Table 2) and the ablation model without AWCBNA (4th column of Table 2) showed the effectiveness of our AWCBNA.   Table 2. Ablation study on datasets of MNIST, Fashion-MNIST, and OPT_aircraft-7, Blk-n means to add AWCBNA after the n-th ResBlk feature maps, and the best FID report. www.nature.com/scientificreports/ Based on the ablation network, we performed another experiment to inspect the effect of ARSA. AWCA-GAN with the ARSA achieved better performance (4th column of Table 2) than AWCA-GAN without the ARSA (2nd column of Table 2) and FID metrics of AWCA-GAN with the self-attention (3rd column of Table 2) reached 0.47, 0.76, and 19.3 on the three datasets. As described in methods, we added the ARSA block to the ResBlk-3 to catch remote dependencies via asymmetric overall regional operations. Compared with the baseline results, the ARSA demonstrated the effectiveness by establishing asymmetric residual distant portions relationships.

Examples of latent space interpolation.
To understand the latent space of generated samples in the same classes on MNIST, Fashion-MNIST and Aircraft-7 datasets, we initialized the prior distribution z ∈ R 128 ∼ N(0, 1) in two latent vectors, say z start and z end , and interpolated it with the adjustable parameter µ i to get continuous latent space interpolation vectors z i = µ i z start + (1 − µ i )z end , µ i ∈ [0, 1] . Figure 8 shows the latent space interpolation results G z * , y on three datasets of trained AWCA-GAN. For OPT_Aircraft-7 in Fig. 8c, we can infer that the AWCA-GAN has learned interesting and pertinent latent space representations. Specifically, at 17 vectors in a series of interpolations z i , the evolved potential space has smooth transitions, and each generated image in latent space appears to be an aircraft. In row 2 of Fig. 8c, a flat-wing plane turned from southwest to south gradually. In row 1 of Fig. 8c, a flat-wing plane faced southwest and turned south step by step. These results demonstrated that our generative model could maintain the semantic context in the potential space; thus, confirming that the AWCA-GAN succeeded in controlling the modified region with the userspecifiable embedding coefficients.
Analysis of embedding results by using GE network. The embedding results were computed by GE with several samples I ∈ R n×n×3 and an embedding label y. In Fig. 9a, two real RS scenes and one aircraft image were fed into the GE network and then the input samples were presented to the seven classes embedding results on the OPT_Aircraft-7 dataset. It demonstrated that the GE network learns features of the input samples and finds more nuanced latent space representations to engender excellent embedding. For example, in the top row of the dark background and the middle row of the bright scene in Fig. 9a, the embedding aircraft exhibited    Fig. 9a also demonstrated that the GE network could capture the structural details of input aircraft samples such as direction, lighting, and shadows. Figure 9b and 9c depicted the embedding results and the loss curve of input RS images on OPT _Aircraft-7. The GE network can learn the intensity of light and the direction of shadows and so on from the input aircraft samples in the process of embedding. The loss function of the OPT_Aircraft-7 dataset converges at around 800 optimisation steps.
Furthermore, we performed embedding experiments on MNIST and Fashion-MNIST datasets. Figure 10 showed the five embedding examples and loss values. Figure 10a presented the embedding result of the number 0 from 1 to 9 and four embedding results of Fashion-MNIST samples. Interestingly, the converted 1 to 9 capture narrow texture features from the input digit 0 samples successfully, and it appears to be written by the same person (e.g., row 1 in Fig. 10a). Similarly, the converted clothes have a similar skinniness to the input clothes samples (e.g., row 2,4, column 1,4 in Fig. 10a). This phenomenon revealed that the GE network has good generalisation and expression power. The well-converged train history of GE was shown in Fig. 10b and c. The loss function of the MNIST dataset converged at approximately 800 optimisation steps, and the Fashion-MNIST dataset converged at approximately 1200 optimisation steps.
Simulation results of ATSS-1. Because trained GE and AWCA-GAN models cannot disentangle objects from the ground-truth images, naively pasting the generated clip to the target image can produce artifacts in the region surrounding the object of interest. We cleaned up these artifacts with Poisson blending to the region of interest. Figure 11 depicted four examples of the application of ATSS-1 and direct pixel collage comparison. We succeeded in making semantic modifications like "changing aircraft type" (Row 2, column 1 in Fig. 11). Column 4,8 in Fig. 11, the embedding samples fused into the real RS scene using Poisson fusion, which appears naturally in the final fused results by changing the intensity of the feature space. The direct collage method (column 3,7) exposed abrupt boundary information between the aircraft and the background by comparison. It was found that the ATSS-1 could synthesise simulated images with high visual and satisfactory visual effects.  . Examples of ATSS-1 vs pixel collage (naive copy-and-paste). The purple border presents the real scenes. The arrows represent the relevant parameters of the ATSS-1, including geographical coordinates, embedding type, direction, and resolution. The red-coloured region and blue-coloured region in columns 2,6 present the blending mask, where red represents the collage area and blue represents areas that are not collaged. Column 3,7 represent pixel collage. Columns

Discussion
We proposed a novel simulation system, based on an improved conditional GAN to achieve different RS images' 4V modelling. The ATSS-1 put forward a new exploration idea and solved three existing problems. First, for the issue of feature entanglement between the sub-classes using simple GAN to generate different sub-class images, we proposed an AWCBNA module in AWCA-GAN. The AWCBNA reassigned the category-wise responses adaptively by mining the inter-class-wise feature statistics. Second, we improved the self-attention module ARSA for more detailed texture modelling. In ARSA, an asymmetric self-attention map was formed in the spatial channel to achieve better feature representation. Third, we introduced the GE network in the front-end of AWCA-GAN to optimise the mapping from the input RS scene to a latent space of the generated target. In this manner, we preliminarily realised that the generated targets could change with RS scenes such as brightness, shadow, and direction. Particularly, we collected and published an aircraft dataset type of OPT_Aircraft-7, which was successfully applied in our simulation experiment and its effectiveness was demonstrated. In addition, we also performed ablation experiments on the proposed modules and compared them with the state-of-art methods, showing superior results. Enabled by the disentanglement and the fine space mapping properties, ATSS-1 realised 4V modelling on a given RS dataset.
Overall, this paper preliminarily explores a 4V modelling method for a variety of RS aircraft. Our generative model joint embedding network demonstrates excellent development potential for the M&S of RS. As the first attempt, we mainly conducted simulation analysis on the OPT_aircraft-V1.0 dataset and configured a simulation unit at a low resolution of 32*32 pixels. Future research will improve the resolution of simulation targets to 64 * 64 pixels or 128 * 128 pixels. At the same time, we will expand the RS target dataset to realise different types of target simulation, such as buildings, traffic vehicles, and rivers to provide a rich and influential data foundation for high-resolution RS.