Domain randomization-enhanced deep learning models for bird detection

Automatic bird detection in ornithological analyses is limited by the accuracy of existing models, due to the lack of training data and the difficulties in extracting the fine-grained features required to distinguish bird species. Here we apply the domain randomization strategy to enhance the accuracy of the deep learning models in bird detection. Trained with virtual birds of sufficient variations in different environments, the model tends to focus on the fine-grained features of birds and achieves higher accuracies. Based on the 100 terabytes of 2-month continuous monitoring data of egrets, our results cover the findings using conventional manual observations, e.g., vertical stratification of egrets according to body size, and also open up opportunities of long-term bird surveys requiring intensive monitoring that is impractical using conventional methods, e.g., the weather influences on egrets, and the relationship of the migration schedules between the great egrets and little egrets.

www.nature.com/scientificreports/ the accuracy of deep learning models for bird detection from the recorded video data, which is the main focus of this study. Domain randomization, which makes use of virtual objects and/or environments to augment the data variability for model training, are found to be effective in building models that are generalized to different complex real-world circumstances 42 . For example, domain randomization with synthetic data helps to train robotic manipulation for grasping specific objects 43 and for learning collision avoidance 44 and gets promising results. Inspired by these studies, we generated synthetic data by creating virtual birds (great and little egrets shown in Fig. 2) in different environments so as to further augment the data size, and at the same time, to ensure sufficient variations for rendering the fine-grained bird features, i.e., neck and head, that are used by experts to distinguish bird species. Then, we pretrained the Faster R-CNN 32 (a deep learning detection model; see Fig. 3a) with the synthetic data, i.e., virtual birds with different real-world backgrounds, followed by fine-tuning the model with real data collected at the test site. Based on the detection results of the continuously monitored data, we conducted analyses to study the bird behaviour, that were practically difficult in the past due to expensive data collection and limited labeled data. Our results not only provide more evidence to support previous studies (e.g., nest site selection of the great egrets and little egrets), but also reveal interesting findings (e.g., weather influences and daily schedule of egrets), suggesting the potential applications of our proposed innovation for better habitat management, conservation policies and protection measures.

Results
Performance evaluation of the domain randomization-enhanced deep learning model. We selected Faster R-CNN 32 to detect different bird species from real images, including great egrets, little egrets and other birds (mainly black-crowned night herons). ResNet-50 45 was used as the backbone for feature extraction, and Feature pyramid network (FPN) 46 was applied to efficiently extract multiscale features. The real-world data was collected with the Green AI Camera (see "Methods"). The labeled data were split into 900, 100 and 110 images as the training, validation and testing sets (see Methods). With domain randomization, we generated 1000 synthetic images for model pretraining, then fine-tuned the model with the 900 real images. Figure 4 depicts an example of the detection result, and the analysed video is provided in the Supplementary Video. The domain randomization-enhanced model is capable of distinguishing and localizing bird species with high prediction scores under different backgrounds (e.g., clear sky, and partially covered by leaves and branches), achieving a mean average precision (mAP, at the intersection over union of 0.5) of 87.65% (see "Methods"). We next examined the advantages of domain randomization on augmenting the detection accuracy. Figure 5 depicts the perception achieved by the models in distinguishing bird species, based on the feature maps computed at the last layer of the ResNet-50 backbone, together with the baseline model (Faster R-CNN trained with real images only) for comparison. We observe that the domain randomization-enhanced model focuses on the subtle features of the neck and head, which are the essential fine-grained features used by the experts to distinguish bird species. On the other hand, the baseline model tends to identify bird species from the color and textural features of the body, which may not be the optimal criteria. It should be noted that the bird size, which is one of the features used by human experts, is not considered by the deep learning models as the observed size could change over the distance from camera, and the depth information is unavailable from the images.
We further examined the effectiveness of domain randomization by performing quantitative evaluation. In addition to the baseline model (Faster R-CNN trained with real images only), the weakly supervised method based on attention mechanism 15 (Fig. 3b), which locates the "parts" of birds and then extract the fine-grained features, was also used as a training strategy for comparison. We used a small subset of the real images for the comparison to highlight the strength of the domain randomization under a limited amount of labeled data, which Figure 2. Application of domain randomization to enhance the accuracy of the detection models. Virtual great egret (a) and little egret (b) were created with the open-source 3D graphics toolset Blender. Synthetic image (c) was generated by merging virtual 3D models and 2D background images. When merging, we applied a large variety of the prominent features, such as body size and pose, at different camera viewpoints and locations at the images, in order to force the models to focus on the fine-grained bird features, which are essential features used by experts for bird identification. We used mAP (IoU = 0.5) to evaluate the performances of all the four cases based on the test set (see "Methods"). The reported mAPs (± standard deviation) of the four cases are 0.342 ± 0.017, 0.397 ± 0.017, 0.329 ± 0.031 and 0.385 ± 0.036. For referencing, the precision-recall curve (at IoU = 0.5) for all four cases is also presented (see Fig. 3c), which indicates information about the trade-off between precision and recall values at different sensitivity threshold. The domain randomization-enhanced model (Case 2) achieves the most balanced result, with better precision scores at a wide range of recall scores between 0.11 and 0.58. Consistent with the previous analysis, Case 2 outperforms the baseline model (Case 1), asserting the effectiveness of the synthetic data with sufficient variations in enabling the models to focus on the fine-grained features of the birds. Although www.nature.com/scientificreports/ the attention mechanism-based model has state-of-the-art performance in some fine-grained visual categorization tasks 15 , overall the model performance is unstable in our study, represented by the relatively low mAP and high standard deviation of Cases 3 and 4. We believe that the failure in gaining advantage of the attention mechanism is due to the restricted resolution of the regional feature maps of the birds (output of the "region of interest pooling" stage shown in Fig. 3b) 39,41 , as the original dimension of bird is small (i.e., about 0.095% of the image size, see "Methods").  47 . We also notice a surge of the great egret counts between the end of October and the beginning of November, which can be explained by the influx of the migratory population during this period 47 . For the little egrets, our analysis shows that similar trend of counts as the great egrets exists, with the Pearson correlation coefficient of 0.89, suggesting the highly-correlated migratory activities between the two species. While supporting the existence of migratory population of the little egrets 47,48 , this finding also motivates us to conduct more studies about the migratory behaviour of and the interaction between egrets.

Analyses of the egret behaviour.
We scaled down the monitoring duration to observe the daily schedule of egrets. For each individual day, we calculated the backward 1-h moving average of the bird counts for each time point, then presumed the time point with the maximum average value in the morning and evening as the departure and return time of the egrets. The departure times of the great egrets and little egrets are similar during the study period (from 04:33 to 07:31, see Fig. 6b), supported by the hypothesis testing HT1 (p-value = 0.296, which is not statistically significant to reject that the departure times are similar; see "Methods"). On the other hand, the little egrets return later than the great egrets on most of the days (from 16:03 to 19:10, see Fig. 6b), with the reported p-value = 0.098 (marginally significant to reject that the little egrets return earlier than or at the same time as the great egrets, in HT2). We believe that the daily schedule of the egrets are highly related to the prey availability in different foraging habitat preferences (e.g., at Hong Kong, little egrets mostly forage at commercial fishponds and great egrets forage at mudflats 49,50 ).
In addition to the temporal perspective, we analysed the spatial distribution of egrets at the study site via heatmap visualization (see "Methods"). We observe some horizontal overlap for the great egrets and little egrets in the middle right regions of the trees (see Fig. 7a). However, in terms of elevation, the coverage of the great egrets is wider, i.e., from the middle to the top of trees (Fig. 7a), whereas the little egrets mainly stay around the middle region of the trees (Fig. 7a). This pattern of vertical stratification, to a certain extent, is consistent with the observations of other studies 51-53 , where birds align themselves vertically according to the actual body size www.nature.com/scientificreports/ (not the body size represented in images, which is influenced by the distance to the camera). Furthermore, we split the testing periods according to the trends observed in the bird counts (Fig. 6a) to identify the spatial change of the egret hotspots (Fig. 7b). For great egrets, the hotspot near the middle right region of the trees remains relatively constant in size and intensity, whereas the hotspots at the top of the trees (vulnerable to wind) shrink and grow over time, and similar changes are observed for the little egrets. Based on the aerial view (Fig. 1b), we observe that the hotspots (at the middle right region of Fig. 7b) are sheltered from wind/gales in the north/ south direction, which renders this location a favorable habitat for egrets to inhabiting, breeding and hatching. We attempted to probe the relationship between weather factors and bird activities (reflected from the bird counts) by building a multivariable linear regression model. The weather data were collected from the Hong Kong Observatory Open Data 54 . We used the natural logarithm of the total bird counts to reflect its proportional Figure 5. Visualization of the perception made by detection models for distinguishing bird species. We compared our proposed method with the same architecture that is trained with real images only (known as the baseline model). For examples of different bird species (first column), the feature maps (response of the model) of the baseline model and domain randomization-enhanced model are shown in second and third column, respectively. The viridis colormap is used to illustrate the intensity of the response, where yellow represents the strongest response. The feature maps are overlaid on the bird images to depict the features used by the baseline model (forth column) and the domain randomization-enhanced model (fifth column) for bird detection. We realize that model pretraining with the synthetic images forces the model to focus on the fine-grained features, such as neck shape and beak color, which are the crucial features used by experts for bird identification. www.nature.com/scientificreports/ change over the study period. We detrended the data based on the hypothesis testing HT3 (p-value = 0.025, which is statistically significant to reject that the time has no influence on the regression; see "Methods"). Table 1 summarizes the multivariable linear regression analysis. We realized that the influence of the combined weather factors is highly statistically significant (p-value of the F-test = 0.002) and undertook further analysis on each factor. Total Bright Sunshine (p-value = 0.031) is statistically significant in affecting bird activities, consistent with previous studies reporting that sunlight influences migration, foraging and other activities of birds [55][56][57] Prevailing Wind Direction (EW, p-value = 0.011) is also statistically significant, suggesting that wind has negative effect on nestlings survival 51 .
Although several studies suggested that temperature and humidity might play crucial roles and affect the bird behaviour 58,59 , e.g., varying air temperature and humidity might influence water temperature, that possible alter the activity levels of forage fishes, which in turn affects the foraging efficiency of egrets 60 , our results in Table 1 shows that the Daily Mean Temperature (p-value = 0.080) and Mean Relative Humidity (p-value = 0.089) are just marginally significant. While supporting the influence of some weather factors in affecting the bird behaviours, more rigorous analysis (e.g., collecting in situ measurements using a local weather station) is required to provide evidence to validate the hypothesis.

Discussion
Here we leverage domain randomization to enhance the accuracy of the bird detection models. We create synthetic data, i.e., virtual egrets merged on real background, to tackle the lack of labeled bird data. Importantly, we demonstrate that by pretraining deep learning models with synthetic data of sufficient variations, we can force the model to focus on the fine-grained features of the great egrets and little egrets, which are the crucial features used by human experts for bird detection. This could be useful and applicable for the detection of other bird species with limited labeled data, of which the features are highly similar (e.g., species under the same family).
We explore the application of domain randomization-enhanced deep learning models based on the 2-months monitoring data of one testing site. Our findings provided multiple potential advantages over conventional monitoring by the visual technique and human endurance. Our deep learning-based object detection enables www.nature.com/scientificreports/ extreme intensive surveys for bird count of different species, behavioural study, spatial preferences and inter-and intra-specific interaction, under different weather conditions (i.e., 12 weather factors were used in this study). In our study, for instance, an influx of the total bird counts might indicate the presence of migratory birds, which could be useful for the study of the bird migratory behaviour and pattern. This new technology could therefore be applied in important breeding, stopover or wintering sites, e.g. the RAMSAR wetlands or Important Bird and Biodiversity Areas (IBAs), in order to monitor their number, and arrival and departure times. A high speed and resolution camera could further allow assessment of their diets and the search time for prey items. Ornithologists could also examine the hotspots of the bird intensity maps and the influence of different weather factors (e.g., wind intensity, relative humidity, temperature) to investigate the nest site selection and habitat preferences of certain bird species across different periods, e.g., during the breeding season. Furthermore, the daily record of the bird departure and return times facilitates the planning of environmental monitoring and audit. Construction activities, which inevitably create noise, could be scheduled at periods with the least numbers of birds, thereby minimizing disturbance to birds during inhabiting, foraging and breeding activities. Following this study, the established framework can be applied to other sites for monitoring same/different bird species, so as to implement more thorough investigation to obtain more conclusive evidences in studying the bird behaviour. Furthermore, datasets of different bird species can be continuously collected and annotated to augment the size of training set. While automating the bird detection, more sophisticated deep learning models and training strategies can also be implemented to enhance the model accuracy. To summarise, this study presents a paradigm shift from the conventional bird detection methods. The domain randomization-enhanced deep learning detection models (together with the Green AI Camera) enables wide-scale, long-term and end-to-end bird monitoring, in turn formulating better habitat management, conservation policies and protective measures.  Fig. 6a to observe the associated changes. The intensity of hotspot reflects the bird counts at a region, which could be used as an indicator to study the nest site selection and habitat preference of different bird species.

Methods
Data collection and annotation. We invented a tailor-made Green AI Camera (see Fig. 1a). The "Green" term refers to the application related to environmental monitoring and ecological conservation, and the "AI" term, that represents Artificial Intelligence, refers to the combination of different types of machine learning and deep learning algorithms used to analyse the huge amounts of data generated from this data collection system. Green AI camera exhibits several advantages to meet the needs of long-term monitoring at outdoor areas: (i) automatic continuous recording; (ii) high-resolution videos; (iii) high-frame rate videos; (iv) huge local data storage; and (v) protection over harsh environments (e.g., extreme weather conditions). As shown in Fig. 1a, six cameras of 4 K resolution are used to continuously render videos of wide field of view. The videos are acquired at a rate of 25 frames per second with High Efficiency Video Coding (also known as H.265) for preserving the visual quality of high-resolution inputs. The external part of the Camera is an optical window composed of two parallel curved surfaces, that provides protection to the electronic sensors and detectors, and at the same time, ensures clear and distortion free field of view.
In this project, we installed the Green AI Camera at Penfold Park, Hong Kong, China (22° 23′ 57.22″ N, 114° 12′ 24.42″ E; see Fig. 1b for the aerial view of the study site) to monitor the bird behaviour at trees in the centre of a pond (see Fig. 1c for the recorded background image). The dominant bird species found at this park are great egrets (Egretta alba) and little egrets (Egretta garzetta). Other birds includes black-crowned night herons (Nycticorax nycticorax), Chinese pond herons (Ardeola bacchus) and eastern cattle egrets (Bubulcus ibis) 7 . The data used in this study covered two periods, i.e., 2019-05-13-2019-09-13 and 2019-09-23-2019-11-26 (62 days); the recorded time for each day was from 04:00 to 20:00, as the usage of infrared is prohibited at the study site (which is completely dark at night). For the first period, we annotated 900 images from the data period of 2019-05-13-2019-06-08 for model training, 100 images from the data period of 2019-08-01-2019-08-17 for model validation, and 110 images from the data period of 2019-08-18-2019-09-13 for model testing. For the second period, this 2-months continuous monitoring data was used for the analyses of the egret behaviour, based on the detected egrets from the trained model. We annotated images by drawing bounding boxes for the observed birds; three labels, which were great egrets, little egrets and other birds (most of them were black-crowned night herons), were used, and the bird counts for each label were 2591, 1401 and 372. Our labels were sent for review by ornithologists for verification. The image dimension was 2139 × 1281 pixels; both length and width of the bounding boxes of birds were within the range of 42 and 160 pixels, and the average and maximum sizes of birds were ~ 0.095% and 2.94% of the image, respectively. Domain randomization. Domain randomization enables the generation of 3D models with desired features and rendering them on specific 2D backgrounds. We used the open-source 3D graphics toolset Blender (v2.79) to create virtual birds, i.e., a great egret and a little egret. Other birds, such as black-crowned night herons, were not considered as their features are highly distinctive compared to egrets. During the development of the virtual images, we applied a large variation of the prominent features (e.g., pose, and body size presented Table 1. Multivariable linear regression analysis. Results of the overall model fit and parameter estimates are compiled to examine the weather influence on the bird activities. Hypothesis testings were carried out at a significant level of 0.05. Bold values indicate the p-values less than 0.05 (statistically significant) and 0.10 (marginally significant). *Individual weather factors that are statistically significant to the count of all birds (p-value < 0.05). **Individual weather factors that are marginally significant to the count of all birds (p-value < 0.10). www.nature.com/scientificreports/ in images due to varying distanced from camera), and environmental and operational conditions (e.g., lighting condition, camera viewpoint and bird location in images), which in turn forced the models to focus on the fine-grained bird features. We used Inverse Kinematics to adjust the armature (bones) of the birds to create different poses. Then, we carefully selected background images that contained no real egrets for creating synthetic images. We merged the 3D models onto the 2D backgrounds, by pasting the virtual egrets using Gaussian paste, Poisson paste and direct paste, at any reasonable location of the images. When pasting, the bird size distribution was set as uniform, ranging from 0.04% to 0.56% of the background dimension. Other attributes included applying light at a uniform distribution between 0.6 and 1.0 of the maximum light values of Blender, and setting the camera viewpoint as a uniform joint distribution of the three Euler angles. All these computing procedures were deployed by creating a python plug-in for Blender. A total of 1000 synthetic images was created for model pretraining.
Network architecture and training details. We selected Faster R-CNN 32 (Fig. 3a) for performing object detection, due to its satisfactory performance in similar tasks 14 . Faster R-CNN is a two-stage detector. In the first stage, Faster R-CNN extracts feature maps from input image using a convolutional layer (see Fig. 3a of the manuscript) and proposes potential image regions that contain target objects with a regional proposal network.
In the second stage, based on the proposed regions, Faster R-CNN extracts the bounding box and the category information of objects using a region of interest head. In this study, the term "convolutional layer" refers to a feature extraction backbone used to extract and learn features from inputs; a ResNet-50 45 , of which the residual networks have been useful, was used as the backbone in this study. Furthermore, we adopted Feature Pyramid Network 46 , which uses a top-down structure with lateral connections to produce feature maps at all scales, to fuse the multi-scale information 46 , so as to enhance the detection of tiny objects in small scale (birds only occupy small areas of the recorded images). The overall architecture of our model was similar to the RetinaNet 34 , except that the cross entropy was used as the loss function in our model, instead of the focal loss. The focal loss, which was used to tackle the problem of imbalanced datasets, was not considered as the usage of synthesized datasets has balanced the proportion of the great egrets and little egrets.
We applied stochastic gradient descent as the optimizer for model training, with a weight decay of 0.0001, a momentum of 0.9 and a learning rate of 0.0005. We first pretrained the Faster R-CNN with the synthetic images, then fine-tuned the pretrained model with the real images. The training was deployed with two GeForce RTX 2080 Ti GPUs and the batch size was two per GPU. For comparison, we used the attention mechanism 15 to build similar object detection models (see the dotted region in Fig. 3b for the attention mechanism) under the same training settings.
Model evaluation metrics. We adopted the commonly used mean average precision (mAP) to evaluate the model performance. The mAP metric jointly takes precision, recall and intersection over union (IoU) into consideration, where precision measures how relevant the predictions are based on all the retrieved instances; recall reflects the fractions of the relevant instances that are actually retrieved; and IoU defines the ratio intersection and union of the predicted and ground-truth bounding boxes. For a specific IoU, mAP is computed by averaging the precision value over the recall values from 0 to 1. For all analysed cases, we trained the model ten times and ran inference for all the individual trained models. The mAP (IoU = 0.5) was reported in the format of "mean ± standard deviation". We also plotted the precision-recall curves (at IoU = 0.5) for all four cases.
Perception made by models for bird detection. We attempted to visualize the feature maps produced by deep learning models, to shed light on how models localize and classify different bird species. However, noise is presented in the feature maps. Such noise effects are located at the null space of the matrix of the affine transformation operator following these feature maps in the network, and are set to zero vectors by the affine transformation and eventually omitted by the network. Therefore, in order to effectively visualize the feature maps without the noise influence, we split the features into row space and null space of the aforementioned matrix, followed by extracting the row-subspace features to visualize the model-related information 61 . Supposing that the matrix of the affine transformation operation is A , the feature maps are x , the coordinates of x in the row subspace are x , the feature maps in the row subspace are x r and the feature maps in the null subspace are x n , the decomposition could be performed by the following Eqs. 61 : After extracting the row-subspace features, the dimensions of the remaining feature maps were in the hundreds, which is difficult to visualize. Therefore, we applied principal component analysis (PCA) to reduce the dimensions of the remained feature maps to three and then used the weighted average of these three dimensions for visualization. As the first dimension usually carries the most important information, we applied the heaviest weight on the first dimension, followed by the second and third dimensions. The weights used herein were the coefficients used to convert RGB images to grayscale images: where V is the weighted feature map, and V 1 , V 2 and V 3 are the first, second and third dimensions after applying PCA.
Heatmap showing the spatial distribution of egrets. Heatmaps were created to visualize the spatial distribution of egrets based on the random field theory. We first partitioned the recorded video frames into cells of 200 × 200 pixels. For each grid, we applied a spatial smoothing to estimate the count of the k th bird species ( k = 1 for the great egret, k = 2 for the little egret) at time t: where x t,p i ,k is the count of the k th bird species of cell p i at time t , after spatial smoothing over all n cells; d p i , p j is the Euclidean distance between the central points of the cells p i , p j ; c t, p i , k is the number of birds located at the corresponding cell; and β is a smoothing constant, satisfying β ≥ 0 . Following that, we applied an exponential smoothing on x t,p i ,k : where s t,p i ,k is the count of the k th bird species of cell p i at time t after spatial and temporal smoothing; and α is a smoothing constant, with 0 ≤ ≤ 1 . After computing all s t,p i ,k , we averaged them over time to create the heatmaps within a specified period.

Statistical analyses.
We computed the Pearson correlation coefficient to identify the correlation between the counts of the great egrets and little egrets. The daily schedule of the great egrets and little egrets were studied with two hypothesis testings using the following null hypotheses: (i) the departure time of the great egrets and little egrets are same (HT1, tested with a two-tailed Student's t-test); and (ii) the return time of the great egrets is equal to or later than the little egrets (HT2, tested with a one-tailed Student's t-test). A significant level of 0.05 was chosen for all hypothesis testings. We also built a multivariable linear regression model to study the weather influence on bird activities. Prior to that, hypothesis testing (HT3) was conducted with a two-tailed Student's t-test to examine whether data detrending was required, by stating a null hypothesis of "the time does not have influence on the bird count-weather relationship". Detrending was conducted to eliminate the time factor that might bias the bird counts: where y t and ÿ t are respectively the original and detrended bird counts at the time step t ; x ti and ẍ ti are respectively the original and detrended i th weather factor at t , and α 0 , α 1 , α 2 , γ 0i , γ 1i and γ 2i are the regression coefficients. The multivariable linear regression model was then built with detrended ÿ t and ẍ ti : where ÿ t is the fitted bird counts at time step t , n is the total number of weather factors and β i is the regression coefficient.