Main

In our rapidly changing world, continuous monitoring of natural systems is essential to understand and mitigate the impact of human activity on ecological processes1,2,3. Recent technological innovations now allow for the rapid collection of ecological data across vast spatial and temporal scales. However, the resulting information deluge creates a bottleneck for researchers, who must process the data at management-relevant timescales4. Artificial intelligence (AI) offers promising solutions for rapid and high-accuracy data processing5,6. However, the dynamic nature of ecological systems poses unique challenges when developing accurate algorithms7,8. To overcome these hurdles, we showcase how the integration of limited human labour into the machine learning workflow can greatly increase both the efficiency and accuracy of data processing.

Long-term camera trapping

We are currently experiencing a rapid, human-driven loss of global biodiversity9,10,11,12. To understand the complex patterns, drivers and consequences of species declines and extinctions, ecologists are increasingly employing emerging technology to assist with data collection and processing. Motion-activated remote cameras (henceforth ‘camera traps’) have emerged as a popular non-invasive tool for monitoring terrestrial vertebrate communities13,14,15. Their decreasing cost and greater reliability have recently led to the application of camera traps for long-term, continuous deployment aiming to monitor entire wildlife communities across multiple seasons and years1,16,17,18. Compared with one-time or annual surveys, continuous monitoring reveals new insights into wildlife responses to local, regional and global environmental changes, as well as to conservation interventions. This scale of monitoring is particularly valuable for capturing responses to environmental perturbations as they occur1,2. The ‘Snapshot Serengeti’ project (http://www.snapshotserengeti.org), which has operated continuously since 2010, is a flagship example of a long-term camera-trap monitoring programme. Over the past decade, this survey has gathered unprecedented longitudinal data that have substantially enhanced our understanding of the seasonal and inter-annual dynamics of the Serengeti ecosystem16,19,20. Projects of this magnitude have become increasingly common across eastern and southern Africa18 and around the world1.

The greatest logistical barrier to long-term monitoring with camera traps is the overwhelming amount of human labour needed to annotate thousands or millions of wildlife images for ecological analysis4,5,16. This annotation bottleneck creates a considerable mismatch between the paces of data collection and data processing, substantially curtailing the usefulness of camera-trap data for ongoing conservation and monitoring efforts4. For example, a relatively modest camera-trap survey (~80 camera traps1) captures millions of images a year. We estimate that it would take a single trained expert around 200 full-time working days to annotate one million images. As such, hundreds of human annotators (for example, experts, trained volunteers and citizen scientists) are required to keep pace with image accumulation. This need is likely to grow exponentially over the coming decades as more monitoring sites are set up. Although only one or two experts are needed to validate each wildlife image, it is common practice that multiple (5–20) volunteers or citizen scientists look at each image to produce a high-accuracy ‘consensus’ classification (~97% accurate compared to expert identifications16). This duplication of effort needed to generate accurate results using volunteers further perpetuates the classification bottleneck.

Automatic image-recognition systems

The use of deep learning (a subset of AI technology) to automatically identify animals in camera-trap images has recently drawn considerable attention from the ecological community. Currently, trained deep learning algorithms can classify a million images in a single day running on a desktop computer, a substantial advancement over the months of effort required for human annotators to accomplish the same task5,21,22.

There have been several attempts to develop robust camera-trap recognition methods for real-world deployment, either tackling distribution shifts (in species numbers and locations) with transfer learning23,24,25 or addressing new species emergence with active learning26,27,28. However, before it becomes feasible to rely on deep learning to handle the mass of image data from large-scale, long-term camera-trap projects, two major impediments must be overcome: (1) accounting for temporal changes in species composition at study sites due to migration, invasion, reintroduction and extinction and (2) handling the long-tailed distribution of records across species (that is, extreme imbalance in the number of images of different species; Extended Data Fig. 1). As discussed in the following, these issues limit the ability of current AI to accurately recognize species that are of notable interest to conservation practitioners.

Changing species composition

A novel challenge for long-term surveys is that new species may be detected on cameras in subsequent seasons or years, either because the species are rare and undetected in previous survey periods29 or because they are new to the system. Additionally, the species composition of ecological systems naturally varies through time through the process of succession30. Novel species are often of particular conservation concern, as they may represent recolonizing populations31, reintroduced animals32 or harmful invasive species33,34.

In conventional deep learning, researchers focus on the performance of existing test data while ignoring the potential for future changes in data composition35. In other words, deep learning models typically require datasets to be fixed in the number of categories (in other words, they are static), whereas, in reality, long-term camera-trap datasets are not constrained to certain numbers of species (they are dynamic).

Fine-tuning models through transfer learning is currently the best solution when new species populate a study area36. However, this process requires full annotation of newly collected datasets, requiring a considerable amount of new human effort. This defeats the purpose of deep learning to reduce manual labour for long-term camera-trap monitoring.

Data from wildlife communities are long-tailed

Wildlife communities typically contain many individuals of several common species and few individuals of many rare species, resulting in camera-trap data with a long-tailed distribution. For example, in the dataset used for the project from Gorongosa National Park, Mozambique, ~50,000 images (>60% of animal images) are of baboons, warthogs and waterbucks, while only 22 images are of pangolins (a rare and protected species). This imbalance creates performance inconsistencies, because deep learning success is derived from balanced training datasets (for example, ImageNet37). For the Gorongosa dataset, a traditional deep learning approach resulted in only 60% accuracy for a category with only 41 images (serval) versus 88.8% performance for a species with 17,938 images (waterbuck). This is a major issue, because animals of particular conservation concern are typically rare38, producing fewer images and therefore worse classification accuracy than for common species. If such species are always misclassified, the practical benefits of AI are limited.

An iteratively updating recognition system

To overcome these two major issues of (1) changing species community composition and (2) long-tailed species distributions, we designed a deep learning recognition framework that is updated iteratively using limited human intervention. Human annotation is needed whenever images of species novel to the AI model appear in the data. Our goal, therefore, becomes to minimize the need for human intervention as much as possible by applying human annotation solely on difficult images or novel species, while maximizing the recognition performance/accuracy of each model update procedure (that is, the update efficiency).

Traditionally, a deep learning model is applied to new batches of unannotated data collected during each time period to predict species classes. In our approach, we actively flag images that our model predicts with low confidence as novel or unknown species. These low-confidence predictions are then selected for human annotation, while high-confidence predictions are accepted as accurate and used as pseudo-labels for future model updates. The model is then updated (that is, retrained) based on both human annotations and pseudo-labels. To accommodate changing species communities, this procedure of active annotation and model update repeats each time new data are added to the collection (Fig. 1). In terms of long-tailed distribution, we use the open long-tailed recognition (OLTR)7 method to balance the learning between abundant and scarce species. This component can reduce the number of predictions with low confidence from scarce species.

Fig. 1: Overview of a realistic animal classification system.
figure 1

a, The dynamic recognition loop. In real-world applications, machine learning models do not stop at one training stage. As data collection progresses over time, there is a continuous cycle of inference, annotation and model updating. Every time a tranche of new data are added, pre-trained models are applied to classify the data. When there are novel and difficult samples, human annotation is required and the model needs to be updated to reflect the newly added data. b, The progression of a realistic animal classification system. Even if the trained model has high accuracy for the previous validation sets, there may be a difference in the classes between previous validation sets and current inference data (for example, there may be novel categories in the newly collected data that did not exist in previous training and validation sets). Models thus need to be updated over time. Here we present a more practical procedure that can both maximize the utility of modern image-recognition methods and minimize the dependence on manual annotation for model updating. We incorporate an active learning technique that actively selects low-confidence predictions for further human annotation, while keeping highly confident predictions as pseudo-labels. Models are then updated according to both human annotations and pseudo-labels. Ann., annotation; CNN, convolutional neural network; CNovel, number of novel classes at time step Tn; N, total number of classes at time step Tn − 1; T, time step.

As a case study, using this new method we trained a model on a camera-trap dataset collected from Gorongosa National Park, Mozambique (details are provided in the Methods), and produced substantially improved model update efficiency over traditional transfer learning approaches. Specifically, using our approach, more than 80% human effort was saved on annotating new data, without sacrificing classification performance.

The dynamic nature of our algorithm maximizes learning and recognition efficiency by taking the best from both humans and machines within a synergistic collaboration, providing a framework that can be practically deployed for long-term camera-trap monitoring studies.

Iterative human and automated identification

In this section, we introduce the overview of our algorithm, data specification and experiment settings.

Algorithm overview

Our approach has two major components: (1) active selection with humans in the loop and (2) model update using active data annotations. For each time period when new data are collected, categories of images are predicted by deep learning models trained from previous periods with corresponding confidence levels. The model actively picks out low-confidence predictions for human annotation, while we accept high-confidence predictions as accurate, without further human verification. These predictions are used as pseudo-labels and included in the final dataset for further model updates or ecological analyses. The model is then updated (retrained) using both pseudo-labels and the newly acquired human annotations (implementation details are provided in the Methods).

After updating the model, we evaluate the model update efficiency and sensitivity to novel categories on a validation set. Specifically, we examine (1) the overall validation accuracy of each category after the update (that is, update performance), (2) the percentage of high-confidence predictions on validation (that is, saved human effort for annotation), (3) the accuracy of high-confidence predictions and (4) the percentage of novel categories that are detected as low-confidence predictions (that is, sensitivity to novelty). The optimization of the algorithm aims to minimize human efforts (that is, to maximize the high-confidence percentage) and to maximize model update performance and high-confidence accuracy.

Data specifications

Data categories

We manually identified a total of 55 categories (that is, species) in our data, including non-animal categories such as ‘ghost’ (misfired images lacking animals), ‘setup’ (images with a human setting up the cameras) and ‘fire’. There were 630,544 images in total. A full list of these categories is provided in Extended Data Fig. 1, along with the number of images associated with each category. Some vague categories that human annotators were unable to label accurately because of the varying quality of camera-trap images were also present, such as ‘unknown antelope’ and ‘unknown bird’.

Two groups of training and validation sets

To ensure sufficient training and validation data, we initially identified 41 of the most abundant categories in our camera-trap dataset. The remaining 14 of the 55 categories were all tagged as ‘unknown’ and used to improve and validate the model’s sensitivity to novel and difficult samples. We randomly split the 41 categories (by trigger events) into two groups of training and validation sets (26 categories in the first group of data and 41 in the second group) to mimic periodic data collection from two sequential time periods. Detailed training and validation split information is provided in the Methods.

Detailed pipeline for experiments

For experimental purposes, we separated our identification pipeline into two steps representing two time periods of data collection and the two groups of data curated in this project (Extended Data Fig. 3). The evaluation is focused on the second period when model update occurs. There are three major technical components in the framework: (1) energy-based loss39, which improves the sensitivity to possible novel and difficult samples for active selection, (2) a pseudo-label-based semi-supervised procedure40 for efficient model update from limited human annotations and (3) OLTR7, which balances the learning of a long-tailed distribution.

Period 1

In the first period, we pre-trained an off-the-shelf model (ResNet-50 model41) using the first group of data. After training, we adopted the energy-based loss39 and data from the 14 ‘left-out’ categories to fine-tune the classifier so it was more sensitive to novel and difficult samples.

Period 2

In the second period, we first used the fine-tuned model from period 1 to produce high- and low-confidence predictions from group 2 training data, which were considered to be ‘newly collected’. The confidence was calculated based on the Helmholtz free energy (details are provided in the Methods) of each prediction39. Novel and difficult samples were distinguished using a preset energy threshold. Then, low-confidence predictions were annotated by humans, while high-confidence predictions were accepted as pseudo-labels.

To update the model, we applied semi-supervised learning and OLTR, using both human annotations and pseudo-labels. Pseudo-label-based semi-supervised approaches iteratively update both the model and pseudo-labels until the best performance on the validation sets is achieved40. The use of pseudo-labels also enables the model to learn from the whole dataset instead of human annotated data only. On the other hand, OLTR approaches balance the learning between abundant and scarce categories through an embedding-space memory-based mechanism, where embedding memories of abundant categories are utilized to enhance the distinguishability of scarce categories that do not have enough samples to otherwise provide discriminative features7. The Methods provides details of these methods.

After the model was updated, the training sample from the 14 ‘left-out’ categories was added to fine-tune the model’s sensitivity to novel and difficult samples using energy-based loss as in period 1.

Future periods

Because the framework is designed to aid long-term data collection and monitoring projects, the framework does not stop at period 2. As time progresses, new data are collected. Users simply have to repeat the steps in period 2 to pick out and annotate difficult/novel samples to update the model. In addition, because the framework is fully modular, when new techniques are developed, parts of the framework can be easily replaced for better performance. For example, if there are better methods for novel category detection, energy-based loss and confidence calculation can be replaced with no effect on the conceptual framework.

Results

Period 1

In the first period, the model achieved an 81.2% average class accuracy on the validation set of group 1 (Table 1), 79.5% of the image predictions were high-confidence and, of these predictions, the accuracy was 91.1% (Table 2). In terms of novel categories, in the validation phase, the model successfully detected 90.1% of the novel samples belonging to the 14 categories that were left out of the training phase. In other words, 90.1% of the novel samples were predicted with low confidence. By contrast, direct softmax confidence (the most conventional way of calculating prediction confidence42) achieved a similar high-confidence accuracy as our model (91.5%), but only detected 59.3% novel samples.

Table 1 Classification performance comparisons on validation sets of periods 1 and 2
Table 2 Active selection performances of periods 1 and 2 with and without energy-based function

Period 2

On group 2 training data, the model pre-trained from period 1 predicted 78.7% of images with high confidence, where the accuracy was 92.4%, while 75.7% of the new categories in group 2 training data were detected as low-confidence predictions (Table 2). As high-confidence predictions are trusted, 78.7% of human effort was saved in annotating group 2 training data because high-confidence predictions were accepted as accurate in our framework.

To update the model, group 2 training data that had been predicted with low confidence were checked by human experts and provided with manual annotations, and high-confidence samples were assigned model-predicted pseudo-labels. Overall, on the validation set of group 2, the model updated on both human annotations and pseudo-labels had an average class accuracy of 77.2% over the 41 categories. Compared to our method without human annotation (69.2%; Table 1), there was an 8% improvement. The model produced 72.2% high-confidence predictions at 90.2% accuracy in the high-confidence predictions of the validation set (Table 2) (see Table 3 for detailed per-category performances). In addition, it produced an 82.6% novel sample detection rate (that is, flagged as low-confidence predictions) from the validation data of the 14 left-out categories (last column of Table 1).

Comparison with traditional transfer learning

Our model (Tables 4 and 5) was substantially more data-efficient (that is, fewer data were required for the same performance) than traditional transfer learning methods in several respects (Fig. 2). Compared to traditional transfer learning, which uses full human annotations of group 2 training data, our method only involves human annotation of 21.3% of the group 2 samples. Even with less human annotation, our method still achieved a better overall class average accuracy (77.2% versus 75.8% for traditional transfer learning; Table 1). Our model also performed better than direct transfer learning for classifying the 15 new categories from group 2 (with an average of 4.2% accuracy improvement; Table 3).

Fig. 2: Label efficiency comparison with transfer learning on the group 2 validation set (ordered with respect to training sample size).
figure 2

To examine label efficiencies (a measure of accuracy given the number of annotations) after we updated our model in period 2, we calculated the validation accuracy over the percentage of used training annotations for each category. In other words, we define label efficiency as Efficiencyi = Validation accuracyi/(no. of training annotationsi/no. of full annotationsi), where i is the category index. The higher the value, the more efficient the model is at learning the corresponding categories and the fewer training data are needed to achieve comparable if not better performance of full manual annotations. We show the label efficiencies of all categories existing in the group 2 training and validation set. The blue bars represent our model’s label efficiencies for each category. The orange bars represent baseline efficiencies for comparison, where full annotations were used with the traditional transfer learning method (that is, no. of training annotationsi/no. of full annotationsi = 1). The blue and orange lines are annotation counts of each category, where orange represents full annotations and blue represents actually used human annotations in our period 2 model update procedure. For categories that exist in both the group 1 and 2 training sets (that is, known categories; on the left, with a blue background), the efficiency is substantially higher than the baselines across all categories. For categories that only exist in group 2 datasets (that is, they were absent in the group 1 training and validation set and are novel categories; on the right, with an orange background), the model is designed to use as much training data as possible because of the novelty of these categories. In other words, the no. of training annotationsi/no. of full annotationsi of these categories is close to 1. Our model still has relatively higher efficiency than the full annotation transfer learning model across all the novel categories because our model had higher validation accuracy with a similar amount of training annotations.

Table 3 Classification performance comparisons of period 2 by category between our method and fully annotated transfer learning

Practical deployment

Our new framework showcases the powerful potential of deep learning for long-term ecological applications while employing a novel practical approach that greatly reduces the manual annotation burden. To validate the practical benefits, we deployed the model to classify a new set of data gathered from the same camera-trap monitoring sites (Gorongosa National Park, Mozambique) after group 1 and 2 datasets were collected (details are provided in the Methods). The new dataset is unannotated, unanalysed and contains 623,333 images in total. Images were predicted with the same active selection procedure, and 78.7% of the predictions were considered high-confidence. Thus, only 21.3% of these newly collected data required human annotation (or 78.7% of the human effort; ultimately, annotation cost was saved).

To validate the robustness of the model performance, two experts (K.M.G. and M.S.P.) confirmed the accuracy of 1,000 randomly selected high-confidence predictions (that is, those that were accepted as accurate). Our model predictions are 88.6% accurate with respect to expert classifications. Statistically, ~88% automatic accuracy is already sufficient to help alleviate the data bottleneck encountered in typical camera-trap monitoring projects compared to expert accuracy.

Table 4 List of the augmentation methods and corresponding parameters we used on our training data
Table 5 List of hyperparameters of our framework used in the two-period experiments

In terms of future model updates, the model can be further updated and validated on the new dataset using the same procedure as for period 2, where a new validation set can be created using a mix of previous validation sets (validation of groups 1 and 2) and the newly acquired human annotations. In addition, the same random verification by human experts on high-confidence predictions can be applied to avoid performance corruptions (that is, increased misclassifications in high-confidence predictions).

Invasive and recolonizing species

One of the nontable advances made by our framework is the ability to flag new or rare species that may have particular conservation importance. Our new dataset contained two novel species (leopard and African wild dog) to test the model’s sensitivity to novel categories. The former naturally recolonized the study area and the latter were reintroduced as a part of ongoing conservation efforts. There were 24 and 5 images for African wild dogs and leopards, respectively. The model successfully detected 20 (83.3%) African wild dog images and four (80.0%) leopard images, demonstrating its capacity to recognize important novel species in continuous monitoring periods.

Discussion

Failure cases

Two types of failure occur in our framework: (1) low-confidence predictions that are not novel species and (2) high-confidence predictions that differ from human-supplied annotations (Fig. 2).

There are several ways in which our model was unable to accurately identify samples from known species with high confidence (Fig. 3a). A common reason for low-confidence predictions was difficulty distinguishing animals from the background. For example, Fig. 3a(i) depicts an antelope obscured by darkness at night, making it difficult for the model to classify with confidence. However, rather than making a misclassification as would occur in traditional AI approaches41, our model considers the low accuracy of the prediction and flags the image for review or similar. In our approach, these difficult samples are flagged as low-confidence predictions for further human evaluation (annotation) rather than assigned random labels — a practice that can potentially bias further data analysis and inference.

Fig. 3: Failure cases.
figure 3

a, Examples of low-confidence predictions. In most cases, the model has low confidence on images with distorted, partially visible (ii–v) or obscured animals (i). It can be incredibly difficult, if not impossible, for either humans or machines to accurately identify the animal species. b, Examples of high-confidence predictions that did not match the original annotations. Many high-confidence predictions that were flagged as incorrect based on validation labels (provided by students and citizen scientists) were in fact correct upon closer inspection by wildlife experts (K.M.G. and M.S.P.). For example, in (i), an empty image, originally mislabelled as baboon, was correctly classified by our method as empty. In (ii), although the animal is distant from the camera in a dark environment, the model successfully identifies hartebeest, while the human-supplied label is ‘unknown antelope’. In (iii), the model successfully identifies the elephant only based on the trunk and leg, while human volunteers originally classified the image as ‘unknown’. In (iv), a vervet monkey is correctly detected and classified in an image originally (incorrectly) labelled as warthog by human annotators. Panel (v) was originally classified as unknown by human annotators, but, based on the body shape and white markings on the rear, the model can correctly recognize the animal as bushbuck. Panel (vi) is an example where multiple species are in the same scene. Although the model does not have the capacity to deal with multi-species samples, as baboon is obviously the major component of this image, the prediction is reasonable. On the other hand, these examples above do not mean that the model always makes correct predictions when highly confident. Panels (vii) and (viii) are two typical examples where the model makes mistakes due to the obscured nature of these images. Red text indicates wrong and green text indicates correct. c, Two examples of image retrieval based on feature space similarity. Machine intelligence largely depends on visual similarity associations learned from large-scale datasets to classify animal species. These two examples illustrate image retrieval based on the Euclidean distances of the feature vectors (that is, outputs of the global average pooling layer of the ResNet model used in the project, which is of dimension 2,048 in Euclidean space). For each anchor image (the leftmost image of each row), we show the five closest (that is, most similar) samples in terms of Euclidean distance within the validation set of group 2. Green colour means correct predictions and red means wrong predictions (based on the original annotations). For example, in sequence (i), samples with similar visual appearance are usually from the same species (waterbuck). However, in sequence (ii), the two most similar images (according to our model) to the banded mongoose anchor image are actually not banded mongoose but slender mongoose. The model misclassified these two samples based on their similarities to the other banded mongoose images.

In the second type of model failure, images predicted with high confidence differ from the original annotations (Fig. 3b). We note that these images were originally classified by volunteers who were trained but may not have correctly annotated all samples as accurately as wildlife experts. Surprisingly, most of the confident predictions are proven to be correct after reevaluation by human experts (K.M.G. and M.S.P.). For example, Fig. 3b(iv) was originally labelled as a warthog, although there is no warthog present. However, there is a vervet monkey in the lower left of the frame that was missed by the human classifiers. The model not only detects the previously unobserved animal but also correctly identifies the species.

Thus, these ‘failures’ actually demonstrate the robustness and flexibility of our framework. As both human annotations and machine predictions can be wrong, a mutual interaction between human and machine can benefit the long-term performance of the recognition system. For example, picking out low-confidence samples like those in Fig. 3b prevents the production of low-quality predictions that can cause bias in camera-trap analyses. Furthermore, applying validated human annotations on these samples can help improve the identification capacity of the model as it needs to recognize more difficult samples during model updates. On the other hand, when the model is highly confident, it can be more accurate than average human annotators, as evidenced by the examples given in Fig. 3b(ii),(iv),(v). In other words, some of the human mistakes are prevented, such that the annotation quality for future model update and camera-trap analyses is improved. On the other hand, as we acknowledge in some cases the model will make incorrect high-confidence classifications, we can apply periodic random verification by human experts on high-confidence predictions (similar to what we did in the ‘Practical deployment’ section) to ensure that these errors do not propagate through repeated training.

The need for humans in the loop

Our framework demonstrates the unique merit of combining machine intelligence and human intelligence. As Fig. 3c illustrates, machine intelligence, when trained on large datasets to distil visual associations and class similarities, can quickly match visual patterns with high confidence37. Human intelligence, on the other hand, excels at being able to recognize fragmented samples based on prior experience, context clues and additional knowledge. Increasingly, we are moving towards applying computer vision systems to real-world scenarios, with unknown classes7, unknown domains8 and constantly updating environments. It is therefore crucial to develop effective algorithms that can handle dynamic data streams. Humans in the loop provide a natural and effective way to integrate the two types of perceptual ability (that is, human and machine), resulting in a synergism that improves the efficiency and the overall recognition system.

Extensions and future directions

Our framework is fully modular and can be easily upgraded with more sophisticated model designs. For example, models with deeper networks can be employed for better classification generalization, more sophisticated semi-supervised training protocols can be adopted for better learning from pseudo-labels, and better novelty detection techniques can be used for better active selection.

Future directions include extending our framework to handle multi-label and multi-domain scenarios. The current approach was developed for single-label recognition (that is, each image only represents one single species). In real-world camera-trap set-ups, it would be desirable to recognize multiple species within the same view. Furthermore, our framework is expected to be deployed in diverse locations with different landscapes. Therefore, our methodology can be more scalable with the ability to handle multiple environmental domains than existing methodologies. In addition, our method will be incorporated in a user-friendly interface, such that users without knowledge of Python can use it.

Methods

Data collection and annotation

The camera-trap data came from the WildCam Gorongosa long-term research and monitoring programme in Gorongosa National Park, Mozambique (18.8154° S, 34.4963° E)43. The data used in this study are from 2016–2019. Cameras were located in a mix of grassland, open woodland and closed forest habitats. K.M.G. placed 60 motion-activated Bushnell TrophyCam and Essential E2 cameras in a 300-km2 area in the southern area of the 3,700-km2 park. Each camera was mounted on a tree within 100 m of the centre of a 5-km2 hexagonal grid cell, facing an animal trail or open area with signs of animal activity. Cameras were set in shaded, south-facing sites that were clear of tall grass to reduce false triggers. Cameras took two photographs per detection (henceforth called a ‘trigger event’) with an interval of 30 s between trigger events. There were 630,544 images in total. The data distribution with respect to categories is reported in Extended Data Fig. 1. In terms of the data split for experimental purposes, detailed distributions of both group 1 and 2 are reported in Extended Data Fig. 2.

Data split

The dataset was randomly split into two groups of training and validation sets to mimic periodic data collection from two sequential time periods, along with an additional ‘unknown’ set for improving and validating the model’s sensitivity to novel and difficult samples. Because we set the cameras to capture one pair of images for each trigger event, image pairs within the same event were usually similar in appearance. To reduce bias, we split the dataset based on camera trigger events, such that both images in a paired trigger event were either in the training or testing set. The training–testing split did not account for camera locations (that is, images from a given camera were present in both testing and training sets). For large-scale, long-term projects, it is more likely that the camera locations are stable. In our study, the cameras cover most of the landscapes in the monitoring area and include a diversity of background types that change seasonally throughout the year. Possible distribution shifts in our dataset solely come from temporal animal community changes instead of spatial landscape/ecosystem changes.

The first group contained the 26 most abundant categories, and the second period contained all 41 categories. We randomly divided each period into training (80% of samples) and validation (20% of samples) sets. For scarce categories that had fewer than 80 images (for example, crested guineafowl, eland, lion and serval), we randomly selected 20 samples instead of 20% of the data to ensure the quality of validation. The labels and distributions of these two groups of data are illustrated in Extended Data Fig. 2.

Within the 14 categories that are tagged ‘unknown’, we randomly selected 80% of data to fine-tune the model’s sensitivity to novel and difficult samples. We then used the rest of the sample from the 14 categories as an extra validation set to evaluate the model’s novel image detection capacity.

Implementation details

In this section, we report the implementation details of our method. It was developed with Python as the programming language with PyTorch44 as the deep learning framework. The detailed experimental pipeline is illustrated in Extended Data Fig. 3.

Data pre-processing

All of the images used in this project were first resized to dimensions of 256 × 256. For training inputs, these images were randomly cropped and resized to 224 × 224. For validation and inference inputs, images were centre cropped to 224 × 224. Table 4 presents the list of data augmentations used for training and corresponding hyperparameters.

Period 1 and baseline model training

There are two steps in this period: (1) baseline model training on group 1 data and (2) classifier fine-tuning using the 14 left-out categories for better sensitivity to novel and difficult samples.

For the baseline model we used ResNet-5041. This was pre-trained on ImageNet37, a generalized object oriented dataset for model weight initialization. The pre-trained model was then trained on group 1 training data, which had 26 categories. All the hyperparameters are provided in Table 5. Model weights with the best validation performance on group 1 validation data were saved as the best model.

After training on group 1 data, we used energy-based loss39 and the 14 left-out categories (tagged as ‘unknown’) to fine-tune the classifier for better sensitivity to novel and difficult samples. The energy-based loss was calculated as

$$\begin{array}{l}{L}_{{{{\rm{energy}}}}}={{\mathbb{E}}}_{{x}_{{{{\rm{known}}}}} \sim {{\mathfrak{D}}}_{{{{\rm{known}}}}}^{{{{\rm{train}}}}}}\left(\max \right.{\left(0,E({x}_{{{{\rm{known}}}}})-{m}_{{{{\rm{known}}}}}\right)}^{2}\\ \hspace{0.17em}+{{\mathbb{E}}}_{{x}_{{{{\rm{unknown}}}}} \sim {{\mathfrak{D}}}_{{{{\rm{unknown}}}}}^{{{{\rm{train}}}}}}\left(\max \right.{\left(0,{m}_{{{{\rm{unknown}}}}}-E({x}_{{{{\rm{unknown}}}}})\right)}^{2}\end{array}$$
(1)
$$E(x)=-T\, {{\mathrm{log}}}\,\mathop{\sum }\limits_{i}^{N}{\rm{e}}^{(f({x}_{i})/T)}$$
(2)

where \({\mathbb{E}}\) is expectation and xknown and xunknown are samples from group 1 and samples from 14 unknown categories, respectively. \({{\mathfrak{D}}}_{{{{\rm{known}}}}}^{{{{\rm{train}}}}}\) and \({{\mathfrak{D}}}_{{{{\rm{unknown}}}}}^{{{{\rm{train}}}}}\) represents datasets of group 1 and 14 unknown categories. E() is the Helmholtz free energy, calculated as the log sum of outputs from the network. \(f(\cdot ):{{\mathbb{R}}}^{D\times D}\to {{\mathbb{R}}}^{K}\) is the network that maps D × D images to K-dimensional vectors. T is the temperature that regularizes the energy. mknown and munknown are two margins applied on known and unknown energy.

During fine-tuning, both cross-entropy loss and energy-based loss are tuned. Equation (3) is the final loss, where w is the weight applied on energy-based loss:

$$L={L}_{{{{\rm{cross}}}}\_{{{\rm{entropy}}}}}+w \, {L}_{{{{\rm{energy}}}}}$$
(3)

All hyperparameters are reported in Table 5.

Period 2 and model update

Active selection and confidence calculation

Following ref. 39, confidence for active selection is calculated based on the Helmholtz free energy (equation (2)). Based on a preset energy threshold τ, predictions are separated into high- and low-confidence categories. In other words, predictions are considered confident if −E(x) > τ and vice versa. Based on prediction confidence, low-confidence predictions are assigned human annotations and high-confidence predictions are utilized as initial pseudo-labels for semi-supervised learning.

Pseudo-labels and semi-supervised learning

Pseudo-label semi-supervision utilizes both human annotations and pseudo-labels to update the model. In the original approach, where models are randomly initialized, pseudo-labels are updated throughout training iterations40. In other words, at each iteration, the model predicts samples without human annotations and uses these predictions as pseudo-labels to train the same samples with a stronger set of data augmentations. In our approach, as the pseudo-labels usually have higher quality than random predictions, we set three semi-update repeats and only update the pseudo-labels at the beginning of each repeat using the best model from the last repeat. Specifically, within each semi-update repeat, the model is updated with a fixed set of pseudo-labels and a number of training epochs. Model weights with the best validation performance are saved, and at the end of the repeat, the best model is used to predict samples without human annotations to produce a new set of pseudo-labels, and a new repeat is started. Only model weights with the best validation performance throughout the three repeats are saved, and the number of repeats is a hyperparameter that can be tuned using validation data. Other hyperparameters are provided in Table 5.

OLTR

OLTR is an additional component in our framework targeting the long-tailed distribution of classes in the datasets. Generally speaking, it uses embedding-level memory of each category to enhance the distinguishability of scarce categories. It is based on the idea that a lot of the mid-level visual features (that is, feature embedding) are shared between similar categories (for example, most of the antelopes share similar body shapes). Because the model can usually learn high-quality feature embedding from abundant species, through memory selection techniques the model is able to select relevant feature embedding to help improve the distinguishability of scare categories. We directly apply OLTR into our framework. For a detailed explanation of OLTR, see ref. 7.

Comparison to unsupervised and self-supervised learning

Although unsupervised learning and self-supervised learning have recently made substantial progress45,46 in learning without human annotations, these learning methods still have difficulties handling novel categories and categories with trivial differences (that is, fine-grained categories)47. This is because current unsupervised and self-supervised learning methods rely on human-defined random data augmentation (for example, cropping and rotation) to mimic intra- and interclass variations, while real-world novel and fine-grained categories often possess complex intra- and interclass distributions. In this work we advocate the use of humans in the loop to provide valuable supervision in a data-efficient manner. Together with semi-supervised learning, our framework can reliably recognize new species with only sparse human annotations.

Additional results

Detailed results of model update performance are listed by category in Table 3.