Introduction

Bird detection, including bird localization, classification, counting and density estimation, is crucial for different applications of ornithological studies, such as investigating the stopover pattern of birds1,2,3; exploring the long-term influence of climatic change on the arrival behavior of migrant birds4 and identifying the habitat selection of different bird species5,6. However, such research remains challenging as manual observation by experts, which is time-consuming and labor-intensive, is the primary method for bird detection. Consequently, the collected data are often fragmentary and of low-frequency and coarse-spatial-resolution7, resulting in a lack of training data. Although radar scanning8,9 and traditional computer vision techniques10,11,12 have been applied to automate bird detection, these techniques are highly influenced by the environment (e.g., trees and buildings)13,14 and ineffective in extracting fine-grained features that are needed by experts to distinguish bird species15. The emergence of deep learning16,17,18,19,20,21,22,23,24, specifically models used for object detection25,26,27,28, has provided opportunities to enhance the efficiency and accuracy of bird detection algorithms14,29,30,31. For instance, deep learning-based semantic segmentation models are used to detect birds and other objects (sky, cloud, forest and wind turbine) from images taken at wind farms29,30,31; different deep learning-based object detection models (e.g., Faster R-CNN32, YOLO33 and RetinaNet34) are evaluated for detecting birds from aerial photographs collected by an unmanned aerial vehicle14. However, these studies focus on distinguishing birds from other objects, instead as classifying different bird species. Some studies focus on bird species classification, e.g., LeNet, a simple deep learning-based classifier was used for classifying hawk and crow from images35. Other training strategies have also been developed to extract fine-grained features for deep learning-based image classification15,36,37,38,39,40. Nevertheless, most of these strategies are inappropriate for the bird detection in this project. For instance, the feature maps produced by bilinear convolutional neural networks that generate second order bilinear features to describe two-factor variation (e.g., “style” and “content”)36,37 are extremely high-dimensional and require excessive computational memory. Next, the limited amount of labeled bird detection data may cause overfitting and restricts the applicability of the domain-specific transfer learning39,40. In addition, objects of low spatial resolution might reduce the capacity41 of the weak supervision15,34 in extracting the fine-grained features.

Thus, alternatives are urgently needed to tackle the aforementioned challenges. We first designed and installed a tailor-made Green AI Camera (see Fig. 1a and “Methods”) at the study site, i.e., Penfold Park, Hong Kong, China (see Fig. 1b), to automate the data collection steps of the bird videos. In addition to saving manpower for manual bird detection, months-long data that were automatically acquired by this advanced vision sensing device enabled comprehensive analyses on bird behavior. Next, we adopted domain randomization to enhance the accuracy of deep learning models for bird detection from the recorded video data, which is the main focus of this study. Domain randomization, which makes use of virtual objects and/or environments to augment the data variability for model training, are found to be effective in building models that are generalized to different complex real-world circumstances42. For example, domain randomization with synthetic data helps to train robotic manipulation for grasping specific objects43 and for learning collision avoidance44 and gets promising results. Inspired by these studies, we generated synthetic data by creating virtual birds (great and little egrets shown in Fig. 2) in different environments so as to further augment the data size, and at the same time, to ensure sufficient variations for rendering the fine-grained bird features, i.e., neck and head, that are used by experts to distinguish bird species. Then, we pretrained the Faster R-CNN32 (a deep learning detection model; see Fig. 3a) with the synthetic data, i.e., virtual birds with different real-world backgrounds, followed by fine-tuning the model with real data collected at the test site. Based on the detection results of the continuously monitored data, we conducted analyses to study the bird behaviour, that were practically difficult in the past due to expensive data collection and limited labeled data. Our results not only provide more evidence to support previous studies (e.g., nest site selection of the great egrets and little egrets), but also reveal interesting findings (e.g., weather influences and daily schedule of egrets), suggesting the potential applications of our proposed innovation for better habitat management, conservation policies and protection measures.

Figure 1
figure 1

Penfold Park, the study site: (a) Green AI camera used to automatically record bird videos at this study site; (b) aerial view of the study site. The orange region represents the viewpoint of Green AI Camera; and (c) trees at the centre of the pond, which is the background image recorded by Green AI Camera.

Figure 2
figure 2

Application of domain randomization to enhance the accuracy of the detection models. Virtual great egret (a) and little egret (b) were created with the open-source 3D graphics toolset Blender. Synthetic image (c) was generated by merging virtual 3D models and 2D background images. When merging, we applied a large variety of the prominent features, such as body size and pose, at different camera viewpoints and locations at the images, in order to force the models to focus on the fine-grained bird features, which are essential features used by experts for bird identification.

Figure 3
figure 3

Object detection model used in this study for bird detection and the performance results of the related experiments: (a) Faster R-CNN, which is the state-of-the-art model in most object detection tasks. In this study, we selected ResNet-50 as the backbone for feature extraction (represented by the “convolutional layer” stage of the figure). We also attached Feature Pyramid Network to Faster R-CNN to render effective fusion of multiscale information; (b) Object detection models trained with the attention mechanism (represented by the dotted region), which was used for the performance comparison with our proposed domain randomization-enhanced deep learning model; and (c) precision-recall curve (at IoU = 0.5) for Case 1 (baseline), Case 2 (domain randomization), Case 3 (attention) and Case 4 (attention + domain randomization).

Results

Performance evaluation of the domain randomization-enhanced deep learning model

We selected Faster R-CNN32 to detect different bird species from real images, including great egrets, little egrets and other birds (mainly black-crowned night herons). ResNet-5045 was used as the backbone for feature extraction, and Feature pyramid network (FPN)46 was applied to efficiently extract multiscale features. The real-world data was collected with the Green AI Camera (see “Methods”). The labeled data were split into 900, 100 and 110 images as the training, validation and testing sets (see Methods). With domain randomization, we generated 1000 synthetic images for model pretraining, then fine-tuned the model with the 900 real images. Figure 4 depicts an example of the detection result, and the analysed video is provided in the Supplementary Video. The domain randomization-enhanced model is capable of distinguishing and localizing bird species with high prediction scores under different backgrounds (e.g., clear sky, and partially covered by leaves and branches), achieving a mean average precision (mAP, at the intersection over union of 0.5) of 87.65% (see “Methods”).

Figure 4
figure 4

Example of bird detection result. The blue, orange and green bounding boxes are used for the predictions of the great egrets, little egrets and other birds.

We next examined the advantages of domain randomization on augmenting the detection accuracy. Figure 5 depicts the perception achieved by the models in distinguishing bird species, based on the feature maps computed at the last layer of the ResNet-50 backbone, together with the baseline model (Faster R-CNN trained with real images only) for comparison. We observe that the domain randomization-enhanced model focuses on the subtle features of the neck and head, which are the essential fine-grained features used by the experts to distinguish bird species. On the other hand, the baseline model tends to identify bird species from the color and textural features of the body, which may not be the optimal criteria. It should be noted that the bird size, which is one of the features used by human experts, is not considered by the deep learning models as the observed size could change over the distance from camera, and the depth information is unavailable from the images.

Figure 5
figure 5

Visualization of the perception made by detection models for distinguishing bird species. We compared our proposed method with the same architecture that is trained with real images only (known as the baseline model). For examples of different bird species (first column), the feature maps (response of the model) of the baseline model and domain randomization-enhanced model are shown in second and third column, respectively. The viridis colormap is used to illustrate the intensity of the response, where yellow represents the strongest response. The feature maps are overlaid on the bird images to depict the features used by the baseline model (forth column) and the domain randomization-enhanced model (fifth column) for bird detection. We realize that model pretraining with the synthetic images forces the model to focus on the fine-grained features, such as neck shape and beak color, which are the crucial features used by experts for bird identification.

We further examined the effectiveness of domain randomization by performing quantitative evaluation. In addition to the baseline model (Faster R-CNN trained with real images only), the weakly supervised method based on attention mechanism15 (Fig. 3b), which locates the “parts” of birds and then extract the fine-grained features, was also used as a training strategy for comparison. We used a small subset of the real images for the comparison to highlight the strength of the domain randomization under a limited amount of labeled data, which is a common challenge in most of the fine-grained visual recognition tasks39. The training set was made up of 60 real images (136 great egrets, 43 little egrets and six other birds) and 1000 synthetic images; the validation set comprised 40 real images (85 great egrets, 12 little egrets and eight other birds) and the test set comprised 40 real images (124 great egrets, 21 little egrets and 10 other birds), respectively. Summarizing, we considered four cases in our comparison:

  1. 1.

    Case 1 (baseline): Faster R-CNN, trained with the real images only.

  2. 2.

    Case 2 (domain randomization): Faster R-CNN, pretrained with the synthetic images, and fine-tuned with the real images.

  3. 3.

    Case 3 (attention): Faster R-CNN with attention mechanism, trained with real images only.

  4. 4.

    Case 4 (attention + domain randomization): Faster R-CNN with attention mechanism, pretrained with the synthetic images, and fine-tuned with the real images.

We used mAP (IoU = 0.5) to evaluate the performances of all the four cases based on the test set (see “Methods”). The reported mAPs (± standard deviation) of the four cases are 0.342 ± 0.017, 0.397 ± 0.017, 0.329 ± 0.031 and 0.385 ± 0.036. For referencing, the precision-recall curve (at IoU = 0.5) for all four cases is also presented (see Fig. 3c), which indicates information about the trade-off between precision and recall values at different sensitivity threshold. The domain randomization-enhanced model (Case 2) achieves the most balanced result, with better precision scores at a wide range of recall scores between 0.11 and 0.58. Consistent with the previous analysis, Case 2 outperforms the baseline model (Case 1), asserting the effectiveness of the synthetic data with sufficient variations in enabling the models to focus on the fine-grained features of the birds. Although the attention mechanism-based model has state-of-the-art performance in some fine-grained visual categorization tasks15, overall the model performance is unstable in our study, represented by the relatively low mAP and high standard deviation of Cases 3 and 4. We believe that the failure in gaining advantage of the attention mechanism is due to the restricted resolution of the regional feature maps of the birds (output of the “region of interest pooling” stage shown in Fig. 3b)39,41, as the original dimension of bird is small (i.e., about 0.095% of the image size, see “Methods”).

Analyses of the egret behaviour

Our analyses focus on the great egrets and little egrets because they are the dominant species at the study site, according to the on-site monitoring and the observations during data annotation. The following analyses are based on the detection results of the videos recorded within the periods 2019-09-23–2019–11-26, which is about 2 months, and the data size is about 100 terabytes.

Figure 6a presents the daily counts of all birds, great egrets and little egrets during the study period based on the detection results. The daily counts (vertical axis of Fig. 6a) are inferred from the maximum number of birds detected in a single frame per day and presumed as the number of birds staying at the study site. The great egret numbers increase and then decrease during the period 2019-09-23–2019-09-30. This trend repetitively occurs at the periods 2019-09-30–2019-10-13, 2019-10-13–2019-10-25, 2019-10-25–2019-11-10 and 2019-11-10–2019-11-23. This observation is consistent with the weekly count of the great egrets in Hong Kong between 1958 and 1998, as documented by the Hong Kong Bird Watching Society47. We also notice a surge of the great egret counts between the end of October and the beginning of November, which can be explained by the influx of the migratory population during this period47. For the little egrets, our analysis shows that similar trend of counts as the great egrets exists, with the Pearson correlation coefficient of 0.89, suggesting the highly-correlated migratory activities between the two species. While supporting the existence of migratory population of the little egrets47,48, this finding also motivates us to conduct more studies about the migratory behaviour of and the interaction between egrets.

Figure 6
figure 6

Analyses of the egret behaviour from the temporal perspective: (a) Daily bird counts during the study period. The bar graphs show the counts of all birds, great egrets and little egrets. We observe the bird counts increase from 2019-09-23 to 2019-09-27 and then decreases from 2019-09-27 to 2019-09-30. Such a trend repetitively occurs at the periods 2019-09-30–2019-10-13, 2019-10-13–2019-10-25, 2019-10-25–2019-11-10 and 2019-11-10–2019-11-23. All these sub-periods are separated with black lines in the figure. We also realize a surge in bird counts between the end of October and the beginning of November, which is believed to be attributed to the presence of migratory population; (b) Daily schedule of the greats and little egrets. The departure and return time of the egrets are estimated from the peak hourly counts in the morning and evening, respectively.

We scaled down the monitoring duration to observe the daily schedule of egrets. For each individual day, we calculated the backward 1-h moving average of the bird counts for each time point, then presumed the time point with the maximum average value in the morning and evening as the departure and return time of the egrets. The departure times of the great egrets and little egrets are similar during the study period (from 04:33 to 07:31, see Fig. 6b), supported by the hypothesis testing HT1 (p-value = 0.296, which is not statistically significant to reject that the departure times are similar; see “Methods”). On the other hand, the little egrets return later than the great egrets on most of the days (from 16:03 to 19:10, see Fig. 6b), with the reported p-value = 0.098 (marginally significant to reject that the little egrets return earlier than or at the same time as the great egrets, in HT2). We believe that the daily schedule of the egrets are highly related to the prey availability in different foraging habitat preferences (e.g., at Hong Kong, little egrets mostly forage at commercial fishponds and great egrets forage at mudflats49,50).

In addition to the temporal perspective, we analysed the spatial distribution of egrets at the study site via heatmap visualization (see “Methods”). We observe some horizontal overlap for the great egrets and little egrets in the middle right regions of the trees (see Fig. 7a). However, in terms of elevation, the coverage of the great egrets is wider, i.e., from the middle to the top of trees (Fig. 7a), whereas the little egrets mainly stay around the middle region of the trees (Fig. 7a). This pattern of vertical stratification, to a certain extent, is consistent with the observations of other studies51,52,53, where birds align themselves vertically according to the actual body size (not the body size represented in images, which is influenced by the distance to the camera). Furthermore, we split the testing periods according to the trends observed in the bird counts (Fig. 6a) to identify the spatial change of the egret hotspots (Fig. 7b). For great egrets, the hotspot near the middle right region of the trees remains relatively constant in size and intensity, whereas the hotspots at the top of the trees (vulnerable to wind) shrink and grow over time, and similar changes are observed for the little egrets. Based on the aerial view (Fig. 1b), we observe that the hotspots (at the middle right region of Fig. 7b) are sheltered from wind/gales in the north/south direction, which renders this location a favorable habitat for egrets to inhabiting, breeding and hatching.

Figure 7
figure 7

Spatial distribution of the great egrets (red) and little egret (blue) at the study over (a) the entire study period; and (b) the testing periods split based on the results observed at Fig. 6a to observe the associated changes. The intensity of hotspot reflects the bird counts at a region, which could be used as an indicator to study the nest site selection and habitat preference of different bird species.

We attempted to probe the relationship between weather factors and bird activities (reflected from the bird counts) by building a multivariable linear regression model. The weather data were collected from the Hong Kong Observatory Open Data54. We used the natural logarithm of the total bird counts to reflect its proportional change over the study period. We detrended the data based on the hypothesis testing HT3 (p-value = 0.025, which is statistically significant to reject that the time has no influence on the regression; see “Methods”). Table 1 summarizes the multivariable linear regression analysis. We realized that the influence of the combined weather factors is highly statistically significant (p-value of the F-test = 0.002) and undertook further analysis on each factor. Total Bright Sunshine (p-value = 0.031) is statistically significant in affecting bird activities, consistent with previous studies reporting that sunlight influences migration, foraging and other activities of birds55,56,57 Prevailing Wind Direction (EW, p-value = 0.011) is also statistically significant, suggesting that wind has negative effect on nestlings survival51. Although several studies suggested that temperature and humidity might play crucial roles and affect the bird behaviour58,59, e.g., varying air temperature and humidity might influence water temperature, that possible alter the activity levels of forage fishes, which in turn affects the foraging efficiency of egrets60, our results in Table 1 shows that the Daily Mean Temperature (p-value = 0.080) and Mean Relative Humidity (p-value = 0.089) are just marginally significant. While supporting the influence of some weather factors in affecting the bird behaviours, more rigorous analysis (e.g., collecting in situ measurements using a local weather station) is required to provide evidence to validate the hypothesis.

Table 1 Multivariable linear regression analysis.

Discussion

Here we leverage domain randomization to enhance the accuracy of the bird detection models. We create synthetic data, i.e., virtual egrets merged on real background, to tackle the lack of labeled bird data. Importantly, we demonstrate that by pretraining deep learning models with synthetic data of sufficient variations, we can force the model to focus on the fine-grained features of the great egrets and little egrets, which are the crucial features used by human experts for bird detection. This could be useful and applicable for the detection of other bird species with limited labeled data, of which the features are highly similar (e.g., species under the same family).

We explore the application of domain randomization-enhanced deep learning models based on the 2-months monitoring data of one testing site. Our findings provided multiple potential advantages over conventional monitoring by the visual technique and human endurance. Our deep learning-based object detection enables extreme intensive surveys for bird count of different species, behavioural study, spatial preferences and inter- and intra-specific interaction, under different weather conditions (i.e., 12 weather factors were used in this study). In our study, for instance, an influx of the total bird counts might indicate the presence of migratory birds, which could be useful for the study of the bird migratory behaviour and pattern. This new technology could therefore be applied in important breeding, stopover or wintering sites, e.g. the RAMSAR wetlands or Important Bird and Biodiversity Areas (IBAs), in order to monitor their number, and arrival and departure times. A high speed and resolution camera could further allow assessment of their diets and the search time for prey items. Ornithologists could also examine the hotspots of the bird intensity maps and the influence of different weather factors (e.g., wind intensity, relative humidity, temperature) to investigate the nest site selection and habitat preferences of certain bird species across different periods, e.g., during the breeding season. Furthermore, the daily record of the bird departure and return times facilitates the planning of environmental monitoring and audit. Construction activities, which inevitably create noise, could be scheduled at periods with the least numbers of birds, thereby minimizing disturbance to birds during inhabiting, foraging and breeding activities. Following this study, the established framework can be applied to other sites for monitoring same/different bird species, so as to implement more thorough investigation to obtain more conclusive evidences in studying the bird behaviour. Furthermore, datasets of different bird species can be continuously collected and annotated to augment the size of training set. While automating the bird detection, more sophisticated deep learning models and training strategies can also be implemented to enhance the model accuracy. To summarise, this study presents a paradigm shift from the conventional bird detection methods. The domain randomization-enhanced deep learning detection models (together with the Green AI Camera) enables wide-scale, long-term and end-to-end bird monitoring, in turn formulating better habitat management, conservation policies and protective measures.

Methods

Data collection and annotation

We invented a tailor-made Green AI Camera (see Fig. 1a). The “Green” term refers to the application related to environmental monitoring and ecological conservation, and the “AI” term, that represents Artificial Intelligence, refers to the combination of different types of machine learning and deep learning algorithms used to analyse the huge amounts of data generated from this data collection system. Green AI camera exhibits several advantages to meet the needs of long-term monitoring at outdoor areas: (i) automatic continuous recording; (ii) high-resolution videos; (iii) high-frame rate videos; (iv) huge local data storage; and (v) protection over harsh environments (e.g., extreme weather conditions). As shown in Fig. 1a, six cameras of 4 K resolution are used to continuously render videos of wide field of view. The videos are acquired at a rate of 25 frames per second with High Efficiency Video Coding (also known as H.265) for preserving the visual quality of high-resolution inputs. The external part of the Camera is an optical window composed of two parallel curved surfaces, that provides protection to the electronic sensors and detectors, and at the same time, ensures clear and distortion free field of view.

In this project, we installed the Green AI Camera at Penfold Park, Hong Kong, China (22° 23′ 57.22″ N, 114° 12′ 24.42″ E; see Fig. 1b for the aerial view of the study site) to monitor the bird behaviour at trees in the centre of a pond (see Fig. 1c for the recorded background image). The dominant bird species found at this park are great egrets (Egretta alba) and little egrets (Egretta garzetta). Other birds includes black-crowned night herons (Nycticorax nycticorax), Chinese pond herons (Ardeola bacchus) and eastern cattle egrets (Bubulcus ibis)7. The data used in this study covered two periods, i.e., 2019-05-13–2019-09-13 and 2019-09-23–2019-11-26 (62 days); the recorded time for each day was from 04:00 to 20:00, as the usage of infrared is prohibited at the study site (which is completely dark at night). For the first period, we annotated 900 images from the data period of 2019-05-13–2019-06-08 for model training, 100 images from the data period of 2019-08-01–2019-08-17 for model validation, and 110 images from the data period of 2019-08-18–2019-09-13 for model testing. For the second period, this 2-months continuous monitoring data was used for the analyses of the egret behaviour, based on the detected egrets from the trained model. We annotated images by drawing bounding boxes for the observed birds; three labels, which were great egrets, little egrets and other birds (most of them were black-crowned night herons), were used, and the bird counts for each label were 2591, 1401 and 372. Our labels were sent for review by ornithologists for verification. The image dimension was 2139 × 1281 pixels; both length and width of the bounding boxes of birds were within the range of 42 and 160 pixels, and the average and maximum sizes of birds were ~ 0.095% and 2.94% of the image, respectively.

Domain randomization

Domain randomization enables the generation of 3D models with desired features and rendering them on specific 2D backgrounds. We used the open-source 3D graphics toolset Blender (v2.79) to create virtual birds, i.e., a great egret and a little egret. Other birds, such as black-crowned night herons, were not considered as their features are highly distinctive compared to egrets. During the development of the virtual images, we applied a large variation of the prominent features (e.g., pose, and body size presented in images due to varying distanced from camera), and environmental and operational conditions (e.g., lighting condition, camera viewpoint and bird location in images), which in turn forced the models to focus on the fine-grained bird features. We used Inverse Kinematics to adjust the armature (bones) of the birds to create different poses. Then, we carefully selected background images that contained no real egrets for creating synthetic images. We merged the 3D models onto the 2D backgrounds, by pasting the virtual egrets using Gaussian paste, Poisson paste and direct paste, at any reasonable location of the images. When pasting, the bird size distribution was set as uniform, ranging from 0.04% to 0.56% of the background dimension. Other attributes included applying light at a uniform distribution between 0.6 and 1.0 of the maximum light values of Blender, and setting the camera viewpoint as a uniform joint distribution of the three Euler angles. All these computing procedures were deployed by creating a python plug-in for Blender. A total of 1000 synthetic images was created for model pretraining.

Network architecture and training details

We selected Faster R-CNN32 (Fig. 3a) for performing object detection, due to its satisfactory performance in similar tasks14. Faster R-CNN is a two-stage detector. In the first stage, Faster R-CNN extracts feature maps from input image using a convolutional layer (see Fig. 3a of the manuscript) and proposes potential image regions that contain target objects with a regional proposal network. In the second stage, based on the proposed regions, Faster R-CNN extracts the bounding box and the category information of objects using a region of interest head. In this study, the term “convolutional layer” refers to a feature extraction backbone used to extract and learn features from inputs; a ResNet-5045, of which the residual networks have been useful, was used as the backbone in this study. Furthermore, we adopted Feature Pyramid Network46, which uses a top-down structure with lateral connections to produce feature maps at all scales, to fuse the multi-scale information46, so as to enhance the detection of tiny objects in small scale (birds only occupy small areas of the recorded images). The overall architecture of our model was similar to the RetinaNet34, except that the cross entropy was used as the loss function in our model, instead of the focal loss. The focal loss, which was used to tackle the problem of imbalanced datasets, was not considered as the usage of synthesized datasets has balanced the proportion of the great egrets and little egrets.

We applied stochastic gradient descent as the optimizer for model training, with a weight decay of 0.0001, a momentum of 0.9 and a learning rate of 0.0005. We first pretrained the Faster R-CNN with the synthetic images, then fine-tuned the pretrained model with the real images. The training was deployed with two GeForce RTX 2080 Ti GPUs and the batch size was two per GPU. For comparison, we used the attention mechanism15 to build similar object detection models (see the dotted region in Fig. 3b for the attention mechanism) under the same training settings.

Model evaluation metrics

We adopted the commonly used mean average precision (mAP) to evaluate the model performance. The mAP metric jointly takes precision, recall and intersection over union (IoU) into consideration, where precision measures how relevant the predictions are based on all the retrieved instances; recall reflects the fractions of the relevant instances that are actually retrieved; and IoU defines the ratio intersection and union of the predicted and ground-truth bounding boxes. For a specific IoU, mAP is computed by averaging the precision value over the recall values from 0 to 1. For all analysed cases, we trained the model ten times and ran inference for all the individual trained models. The mAP (IoU = 0.5) was reported in the format of “mean ± standard deviation”. We also plotted the precision-recall curves (at IoU = 0.5) for all four cases.

Perception made by models for bird detection

We attempted to visualize the feature maps produced by deep learning models, to shed light on how models localize and classify different bird species. However, noise is presented in the feature maps. Such noise effects are located at the null space of the matrix of the affine transformation operator following these feature maps in the network, and are set to zero vectors by the affine transformation and eventually omitted by the network. Therefore, in order to effectively visualize the feature maps without the noise influence, we split the features into row space and null space of the aforementioned matrix, followed by extracting the row-subspace features to visualize the model-related information61. Supposing that the matrix of the affine transformation operation is \(A\), the feature maps are \(x\), the coordinates of \(x\) in the row subspace are \(\widehat{x}\), the feature maps in the row subspace are \({x}_{r}\) and the feature maps in the null subspace are \({x}_{n}\), the decomposition could be performed by the following Eqs. 61:

$$A{A}^{T}\widehat{x}=Ax$$
(1)
$${x}_{r}={A}^{T}\widehat{x}$$
(2)
$${x}_{n}=x-{x}_{r}$$
(3)

After extracting the row-subspace features, the dimensions of the remaining feature maps were in the hundreds, which is difficult to visualize. Therefore, we applied principal component analysis (PCA) to reduce the dimensions of the remained feature maps to three and then used the weighted average of these three dimensions for visualization. As the first dimension usually carries the most important information, we applied the heaviest weight on the first dimension, followed by the second and third dimensions. The weights used herein were the coefficients used to convert RGB images to grayscale images:

$$V=0.7152{V}_{1}+0.2126{V}_{2}+0.0722{V}_{3}$$
(4)

where \(V\) is the weighted feature map, and \({V}_{1}\), \({V}_{2}\) and \({V}_{3}\) are the first, second and third dimensions after applying PCA.

Heatmap showing the spatial distribution of egrets

Heatmaps were created to visualize the spatial distribution of egrets based on the random field theory. We first partitioned the recorded video frames into cells of 200 × 200 pixels. For each grid, we applied a spatial smoothing to estimate the count of the \({k}^{th}\) bird species (\(k=1\) for the great egret, \(k=2\) for the little egret) at time \(t\):

$${x}_{t,{p}_{i},k}={\sum }_{j=1}^{n}{e}^{-\upbeta d\left({p}_{i},{p}_{j}\right)}c\left(t,{p}_{i},k\right)$$
(5)

where \({x}_{t,{p}_{i},k}\) is the count of the \({k}^{th}\) bird species of cell \({p}_{i}\) at time \(t\), after spatial smoothing over all \(n\) cells; \(d\left({p}_{i},{p}_{j}\right)\) is the Euclidean distance between the central points of the cells \({p}_{i},{p}_{j}\); \(c\left(t,{p}_{i},k\right)\) is the number of birds located at the corresponding cell; and \(\upbeta\) is a smoothing constant, satisfying \(\upbeta \ge 0\). Following that, we applied an exponential smoothing on \({x}_{t,{p}_{i},k}\):

$${s}_{t,{p}_{i},k}=\uplambda {x}_{t,{p}_{i},k}+\left(1-\uplambda \right){s}_{t-1,{p}_{i},k}$$
(6)

where \({s}_{t,{p}_{i},k}\) is the count of the \({k}^{th}\) bird species of cell \({p}_{i}\) at time \(t\) after spatial and temporal smoothing; and \(\mathrm{\alpha }\) is a smoothing constant, with \(0\le\uplambda \le 1\). After computing all \({s}_{t,{p}_{i},k}\), we averaged them over time to create the heatmaps within a specified period.

Statistical analyses

We computed the Pearson correlation coefficient to identify the correlation between the counts of the great egrets and little egrets. The daily schedule of the great egrets and little egrets were studied with two hypothesis testings using the following null hypotheses: (i) the departure time of the great egrets and little egrets are same (HT1, tested with a two-tailed Student’s t-test); and (ii) the return time of the great egrets is equal to or later than the little egrets (HT2, tested with a one-tailed Student’s t-test). A significant level of 0.05 was chosen for all hypothesis testings. We also built a multivariable linear regression model to study the weather influence on bird activities. Prior to that, hypothesis testing (HT3) was conducted with a two-tailed Student’s t-test to examine whether data detrending was required, by stating a null hypothesis of “the time does not have influence on the bird count-weather relationship”. Detrending was conducted to eliminate the time factor that might bias the bird counts:

$${\mathrm{ln}}\left({\ddot{y}}_{t}\right)={\mathrm{ln}}\left({y}_{t}\right)-{\widehat{\alpha }}_{0}-{\widehat{\alpha }}_{1}t-{\widehat{\alpha }}_{2}{t}^{2}$$
(7)
$${\ddot{x}}_{ti}={x}_{ti}-{\widehat{\gamma }}_{0i}-{\widehat{\gamma }}_{1i}t-{\widehat{\gamma }}_{2i}{t}^{2}$$
(8)

where \({y}_{t}\) and \({\ddot{y}}_{t}\) are respectively the original and detrended bird counts at the time step \(t\); \({x}_{ti}\) and \({\ddot{x}}_{ti}\) are respectively the original and detrended ith weather factor at \(t\), and \({\widehat{\alpha }}_{0}\), \({\widehat{\alpha }}_{1}\), \({\widehat{\alpha }}_{2}\), \({\widehat{\gamma }}_{0i}\), \({\widehat{\gamma }}_{1i}\) and \({\widehat{\gamma }}_{2i}\) are the regression coefficients. The multivariable linear regression model was then built with detrended \({\ddot{y}}_{t}\) and \({\ddot{x}}_{ti}\):

$${\mathrm{ln}}\left({\widehat{\ddot{y}}}_{t} \right)=\sum_{i=1}^{n}{\widehat{\beta }}_{i}{\ddot{x}}_{ti}$$
(9)

where \({\widehat{\ddot{y}}}_{t}\) is the fitted bird counts at time step \(t\), \(n\) is the total number of weather factors and \({\widehat{\beta }}_{i}\) is the regression coefficient.