replicAnt: a pipeline for generating annotated images of animals in complex environments using Unreal Engine

Plum, Fabian; Bulla, René; Beck, Hendrik K.; Imirzian, Natalie; Labonte, David

doi:10.1038/s41467-023-42898-9

Download PDF

Article
Open access
Published: 08 November 2023

replicAnt: a pipeline for generating annotated images of animals in complex environments using Unreal Engine

Nature Communications volume 14, Article number: 7195 (2023) Cite this article

5728 Accesses
3 Citations
167 Altmetric
Metrics details

Subjects

Abstract

Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To help overcome these limitations, we developed replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation. We also show that it increases the subject-specificity and domain-invariance of the trained networks, thereby conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

Using DeepLabCut for 3D markerless pose estimation across species and behaviors

Article 21 June 2019

SLEAP: A deep learning system for multi-animal pose tracking

Article Open access 04 April 2022

Multi-animal pose estimation, identification and tracking with DeepLabCut

Article Open access 12 April 2022

Introduction

Enabled by the continued reduction in cost of computational hardware and breakthroughs in deep neural network architectures and training paradigms, data-driven deep learning approaches now represent the state of the art in almost all computer vision applications^1,2. This success has been achieved in discriminative applications such as classification³, detection⁴, pose-estimation⁵, and semantic segmentation⁶, as much as in generative applications, as demonstrated by recent advancements in diffusion networks which can create stylised and near photo-realistic images from text prompts^7,8. Both discriminative and generative approaches have in common that they primarily involve supervised learning, which, to an extent, resembles high dimensional interpolation: achieving generalisability is practically synonymous with ensuring that inputs at training time reasonably resemble those encountered at inference time. As an illustrative example, successful detection requires that instances of the target class are identified regardless of image context and subject appearance⁴; the ideal detector is subject-specific, but domain-invariant. Large, curated and annotated datasets—such as those provided by ImageNet⁹, COCO¹⁰, or CiFAR¹¹—are indispensable in this process, as they provide a basis for learnable real-world principles, and complex testing grounds.

A prime area of application for the emerging machine learning toolset is animal behavioural research^{12,13,14,15,16,17,18}, where it promises to reduce time costs, increase statistical power, and minimise potential for human bias; machine learning may altogether revolutionise what is possible in ethology^13,18, and its intersection with neuroscience^16,19,20, morphology²¹, locomotion^13,18,22, and conservation²³. Despite the divergence in the questions tackled, all applications in these research areas have in common the need for annotated training data. Unfortunately, datasets of a size and quality required to achieve robust domain-invariant inference are rarely available, and—apart from a few model species such as mice or Drosophila—the effort required to curate them often outweighs the immediate benefit of the enabled automation. Transfer learning—i.e., pre-training (parts of) a network on a separate, much larger, dataset, and refining the network on a small number of hand-annotated images—is a strategy that has been implemented with great success in markerless animal pose-estimation^16,17,20. However, the price paid for the substantial reduction in the necessary amount of hand-annotation is that the resulting networks are typically only performant under stereotyped conditions, and frequently require extensive input pre-processing. Even minute deviations from the refinement data—for example in form of partial occlusion or changes in specimen appearance, lighting, background, perspective, or camera type—can result in a substantial drop in network performance. As a result, transfer-learning strategies perform best in well-controlled recording conditions, and additional refinement is required to analyse more variable footage from the gold standard of behavioural studies—field experiments. Although refinement with relatively few hand-annotated samples of the order of a few hundred to a few thousand can enable accurate inference under field conditions^{13,17,20,23,24,25,26}, large appearance deviations from the hand-annotated examples—for example, due to changes in weather conditions, recording background, the time of day, or varying camera perspective—typically considerably decrease performance^17,19,23,27: networks learn latent features specific to the recording environment, rather than a general subject-specific understanding. Some of these generalisation issues can be addressed through data augmentation, i.e. the application of image perturbations with the aim to alter image appearance while retaining its meaning and label^4,28,29. For example, by changing the rotation, scale, hue, and resolution of an image, its contents would still remain identifiable. More sophisticated augmentation strategies, such as style transfer, can further improve network robustness^30,31,32. Alternatively, where large volumes of unlabelled data are available, self-supervised approaches may be employed to learn consistently identifiable features^24,33. But these features may then be distinct from case-specific points of interests, in some sense just passing the baton of key-point extraction further down in the analysis pipeline. Currently, even extensive augmentation and unsupervised or self-supervised strategies still pale in their efficacy in comparison to simply using larger and more varied datasets in supervised approaches instead^4,28,29.

In robotic^34,35,36, human^37,38,39,40, and automated driving^41,42,43 applications, annotated datasets comprising billions of images can now be produced “synthetically”, i.e. through simulation with a computer. By placing 3D models in simulated environments, variable and annotated datasets can be generated at scale, and at a fraction of the cost and time required for hand-annotation of real images^39,40,42,44. The use of synthetic data is particularly attractive where annotated real datasets are practically absent or only of insufficient size, as is the case for almost all non-human animal studies^{22,30,31,32,45,46,47,48,49,50}. However, for all its conceptual attractiveness, using synthetic data is not without problems: the simulated images must bridge the “simulation-reality gap”, i.e. they must be comparable in appearance to real images; as before, the key challenge remains that the training data must represent a superset of the inputs received at inference time^{22,27,47,48,49}. As an illustrative example, Arent et al.²² modelled Indian stick insects as a rigid body consisting of simple geometrical shapes to improve the performance of a DeepLabCut²⁰ pose estimator. Such simplified geometric approaches can improve performance, but remain restricted to stereotyped recording settings, simple animal morphology, and a single output data type. More complex approaches have used hand-animated or learned motion priors, or combined low fidelity synthetic data with style- or domain transfer networks to close the simulation-reality gap^{27,30,31,47,48,49,50}. These approaches however remain labour-intense, tied to specific species, possess limited options for annotations, or still require extensive real image datasets in order to generalise to real examples. Comprehensive and generalisable approaches which utilise more realistic animal representations, handle large digital animal populations, can create highly variable environments, and provide options for complex annotation, remain absent.

Here, we address this gap and present a synthetic dataset generator, replicAnt, implemented in Unreal Engine 5, a 3D computer graphics game engine, and Python. replicAnt can be used to simulate the appearance of animals in complex, procedurally generated environments with all but a few clicks of a mouse. Leveraging recent advancements in photogrammetry, real-time ray tracing, and high-resolution mesh handling, replicAnt runs on consumer-grade computational hardware, automatically produces rich image annotations, and can simulate virtually any recording conditions, including variations in camera model and perspective; individual number, size, pose, and colouration; scene lighting; image resolution and magnification; and environment appearance. We demonstrate the versatility and utility of replicAnt by using the synthetic data it generates to train deep neural networks for automatic inference in four common animal applications: (1) detection—localising animals in an image; (2) tracking—retaining the identity of animal detections across continuous frames; (3) markerless pose-estimation—extracting the coordinates of user-defined body landmarks; and (4) semantic and instance segmentation—determining which areas of an image correspond to an animal on a pixel level.

Results

replicAnt

replicAnt uses 3D models of animals to produce a user-defined number of annotated images. It is designed to generate large and variable datasets involving hundreds of animals with minimal user effort; due to the rich and automated annotation, a single synthetic dataset can then be used to train a variety of deep neural networks. replicAnt requires: 3D models of the study organism(s); the installation of a pre-configured Unreal Engine project; and custom-written data parsers, used to translate the generated data into formats compatible with the deep learning-based computer vision system(s) of choice (see Fig. 1).

replicAnt is agnostic to the origin of the subject 3D model(s) used as input. Throughout this work, we use high resolution 3D models produced with the open-source photogrammetry platform scAnt (Fig. 1a)⁵¹; but we also demonstrate that simpler hand-sculpted models can suffice for some applications. In general, the higher the 3D model fidelity, the higher the application flexibility.

Depending on their origin, 3D models may need to be cleaned, and—if the randomised pose variation feature of replicAnt is to be used—virtual bones and joints need to be assigned, and their range of movement defined (see Fig. 1c, “Methods—3D subject models”, and the replicAnt GitHub https://github.com/evo-biomech/replicAnt). In this paper, we focus on insects, first because of personal predilection, and second because an exoskeleton avoids the need to simulate the complex soft tissue deformation associated with postural changes in animals with endoskeletons. However, powerful approaches to create photo-realistic models of vertebrates exist^52,53,54, and replicAnt is not limited to arthropod models (or even just animals, for what it is worth). The cleaned and rigged model is imported into a pre-configured Unreal Engine 5 project, where a simplified collision mesh is computed to enable interactions with objects inside the simulated world (Fig. 1d).

Next, a customisable digital “population” is generated by simulating multiple instances of the original subject model. Variation between subject instances is achieved through simple appearance modifications, such as changes in brightness, contrast, hue, saturation, and scale. The range of these modulations can be adjusted through a simple user interface, and custom modifications can be added. Subjects are later sampled at random from this population, and placed into procedurally generated environments, from which annotated images are extracted.

Each scene is generated in a hierarchical process, structured into five customisable levels to maximise computational efficiency; changes in lower-level hierarchical elements influence higher level elements (Fig. 1e). At the lowest level of scene hierarchy sits a ground plane with random topology. At the second level, this ground plane is populated with 3D assets; polygon meshes of objects such as plants, rocks, and common household items, all of random size. Assets are drawn from a curated library, and placed by a configurable number of asset scatterers. At the third level, the ground and each asset are assigned Physically Based Rendering materials, generated by blending randomly generated patterns with a curated texture library. Large material maps, or decals, are generated and wrapped around the ground plane and all assets to achieve a cohesive scene appearance. At the fourth level, a configurable number of subjects from the model population are placed at random locations, and their pose is adjusted via inverse kinematics, such that they can interact with the surrounding meshes. At the fifth and highest hierarchical level, scene lighting is introduced in form of a configurable number of coloured light sources and High Dynamic Range Images (HDRIs), and a virtual camera with randomisable extrinsics, intrinsics and post processing parameters is placed; the scene generation is now complete (Fig. 1f).

Using the virtual camera, “image passes” are exported from each scene iteration. Each pass encodes different information (Fig. 1g–j), for example the optical image render itself, or depth information (for details, see “Methods”). User-defined passes can be added as required. Each image pass set is supplemented by a data file which contains configurable annotations, for example subject bounding boxes, 2D and 3D key point coordinates, class labels, or camera intrinsics and extrinsics. The combination of image passes and data files constitute synthetic data which can be used to train deep neural networks for various computer vision tasks (Fig. 1k–n).

The entire process, including the generation of a user-specified number of scene iterations, image pass rendering, and data file writing, is fully automated, but leaves open plenty of opportunity to introduce variation with minimal effort. The pre-configured Unreal project, detailed documentation, and additional resources are available from https://github.com/evo-biomech/replicAnt.

Applications

In order to demonstrate that the synthetic data generated by replicAnt is of sufficient quality to power applications in animal behavioural research, we used it to train various popular deep learning networks for animal detection, tracking, pose-estimation, and semantic and instance segmentation. The performance of these networks was then evaluated on dedicated example datasets. Unless stated otherwise, all synthetic data used for training was generated using replicAnt’s default settings (see “Methods—Data parsers” and GitHub for details). We will now show that replicAnt significantly improves the trained networks’ ability to generalise to unseen conditions; in some cases, it removes the need for hand-annotation altogether, and in others, it may present the only option to generate datasets large enough to train robust and performant networks in reasonable time.

Detection

A digital population of Atta vollenweideri leafcutter ants (Forel 1893), comprising 100 simulated individuals, was created using 3D models of a minor, media, and major worker, all generated with scAnt⁵¹ (Fig. 2b, see “Methods—Detection” for details). This population formed the basis for two synthetic datasets, each encompassing 10,000 annotated images with a resolution of 1024 × 1024 px: one used all three 3D models (“group”), and one using only the largest model (“single”). Furthermore, to investigate the influence of synthetic dataset size on inference performance, networks were trained on 1% (“small”), 10% (“medium”), and 100% (“large”) of the “group” dataset. Dataset generation took about ten hours each for the full “group” and “single” datasets on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

**Fig. 2: Performance of YOLOv4 detectors trained with real, synthetic and “mixed” data.**

The generated synthetic datasets were then used to train a commonly used object detector, YOLOv4⁴, subsequently tested on laboratory recordings of a crowded foraging trail (Fig. 2d). Foraging trails of Atta ants present an ideal example for complex detection tasks as individuals vary in size, trails are highly cluttered, and partial as well as full occlusions occur frequently. In order to introduce variation in scene appearance, akin to what may be expected in field conditions, scene lighting, exposure time, camera magnification and foraging trail background were altered systemically, yielding five different recording scenarios (Fig. 2e). For each recording scenario, 1000 frames each with between 36 to 103 individuals were hand-annotated using BlenderMotionExport⁵⁵. For comparison, we also trained detectors on 5000 of these hand-annotated images, using image combinations from the different recording scenarios. Five-fold cross validation, with 80/20 splits between training and validation data, was used for all training (see “Methods” for details on test data and training schedules).

In general, detectors performed best on within-domain data, where they achieved close to perfect performance (Fig. 2g). The notable exception to this rule were close-up recordings, where the detector trained on synthetic data outperformed the within-domain network. However, the performance of detectors trained with real data dropped notably when they were used for inference on unseen images, despite the similarity in perspective (Fig. 2g, h). In sharp contrast, the detectors trained with synthetic data retained a robust and consistent performance throughout (Fig. 2g). To quantify this difference, the Average Precision (AP) was averaged, so yielding a mean Average Precision across all unseen test cases (mAP, see Eq. (2)). The detectors trained exclusively with synthetic data achieved an mAP of 0.913 ± 0.0079, both higher and less variable than all detectors trained with any of the five real sub-datasets (Fig. 2g). For comparison, the best real data detector was trained on noisy images, and achieved an mAP of 0.878 ± 0.0258. Networks trained exclusively on synthetic data converged more slowly and exhibited an overall higher loss during training compared to any set of real training images, indicating a higher level of complexity of the generated images (Fig. 2f). These results indicate that the large volume and variability of synthetic data substantially increases robustness of detections; supplementing training datasets with synthetically generated samples may be a suitable strategy to significantly reduce the hand-annotation required to achieve benchmark performance, and can improve the ability of networks to generalise to unseen conditions. To test these ideas, detectors were trained on “mixed” datasets, containing both real and synthetically generated images (see methods for details). Networks trained with a 10,000/100 synthetic/real split (“sb1”) achieved an mAP of 0.9501 ± 0.014, close to the benchmark performance (Fig. 2g). A more extensive quantitative comparison of the performance across inference cases is provided in the Supplementary Table 5. In order to confirm that synthetic data enables networks to recognise ants specifically and not just objects with similar appearance, we tested detectors trained with 3D models of desert termites (Gnathamitermes sp., see below), which resulted in a negligible mAP of 0.007 ± 0.005 (Fig. 2g).

Next, we sought to demonstrate that high model fidelity is not required for detection tasks which typically involve low magnification recordings. Instead, even simpler hand-sculptured models can be used to train performant networks, powered by the large volume and variability of training images that can be generated with replicAnt. We procured a test dataset of 1000 consecutive frames of 49 freely moving desert termites, Gnathamitermes sp., recorded in the field and hand-annotated using BlenderMotionExport⁵⁵ to provide a simple benchmark (Fig. 3 and “Methods” for details). Two 3D models, one of a worker and one of a soldier, were hand-sculpted from reference images using Blender v3.1. A YOLOv4⁴ network, trained on a dataset of 10,000 synthetically generated images with a resolution of 1024 × 1024 px (Fig. 3a–c, see Supplementary Table 4 for details), achieved an AP of 0.956 ± 0.001 on the annotated recordings, and produced accurate detections in qualitative test cases (Fig. 3e, f).

**Fig. 3: Even low fidelity 3D models can be used to train performant networks for low-magnification applications such as animal detection.**

Multi-animal tracking

Sufficiently precise detectors can in principle be used to build simple, yet robust and performant trackers. To facilitate the use of replicAnt-trained detectors in tracking applications, we introduce OmniTrax⁵⁶, an open-source Blender add-on. OmniTrax allows users to conduct interactive detection-based buffer-and-recover tracking using imported YOLO detector networks^4,57, and multi-animal pose-estimation, using DeepLabCut (see below, and ref. ²⁰); it also provides extensive annotation options (see Supplementary Videos 1–4 and 7). Tracking is achieved by linking YOLO detections across frames via Kalman-Filtering and the Hungarian method for track association⁵⁸. To assess the performance of this simple tracking architecture, we imported the best performing detection networks trained exclusively on synthetic data into OmniTrax, and tracked laboratory and field recordings of A. vollenweideri leafcutter ants and Gnathamitermes sp. desert termites (Fig. 2). The ant detector tracked between 61 and 103 A. vollenweideri ants over 1000 frames at 30 fps, equivalent to real time inference on a consumer-grade laptop (6 core CPU, 16 GB Ram, RTX 2070); the desert termite detector tracked 49 individuals across 1000 frames. Default tracker settings were used for both test cases.

The ant tracker achieved Multiple Object Tracking Accuracy (MOTA) scores of 0.901, 0.945, 0.859, 0.821 in the “base”, “dark”, “bright”, “noisy” cases, respectively (see Eq. (3), ref. ⁵⁹, and Fig. 4). Most ID switches were caused by track fragmentation, which can be avoided by refining tracker settings, or through simple manual corrections within OmniTrax. An extensive quantitative comparison of the performance across inference cases is provided in Supplementary Table 5. The desert termite tracker achieved a MOTA of 0.96. Only two true ID switches occurred; the remaining errors reflect partially fragmented tracks, and a single out-of-focus animal which was not registered (Fig. 4). Thus, and despite its structural simplicity compared to other recent approaches^16,60,61,62, the detection-based tracker powered by replicAnt and implemented in OmniTrax can track a large number of animals in crowded and open scenes—without the need to hand-annotate a single image.

**Fig. 4: Detectors trained exclusively on synthetic data can be used for multi-animal buffer-and-recover tracking.**

Pose-estimation

Animal pose-estimation typically leverages transfer-learning. Although excellent performance is possible in controlled settings, the characteristically small training datasets usually fail to provide the variability required for generalisation to unseen recording settings. As a result, performance is extremely sensitive to changes in scene or specimen appearance^15,16,17. The key problem is that maximising performance through overfitting of perspective-dependent latent features leads to domain-dependence. Our aim is to overcome this limitation by leveraging the large and variable synthetic datasets produced by replicAnt to embed an improved subject understanding into the pose-estimation networks. In other words, we ultimately seek to train a single generalist network, rather than several scene- and perspective-dependent specialists, as is currently best practice^16,17,20.

To move towards this aim, we used a 3D model of a sunny stick insect (Sungaya inexpectata Zompro 1996, first instar) to generate 10 sub-datasets with different randomisation seeds, characterised by 70% scale variation, and hue, brightness, contrast, and saturation shifts producing 1000 samples each. These datasets were combined into one single-animal synthetic dataset encompassing 10,000 images at a resolution of 1500 × 1500 px; the automated annotations included the location of 46 key points distributed along the body (Fig. 5a–c, see “Methods” and Supplementary Fig. 1 for further details). Dataset generation took about three hours on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

**Fig. 5: Performance of DeepLabCut (DLC) markerless pose-estimators²⁰ trained on real, synthetic and mixed datasets.**

The synthetic dataset was then used to train a DeepLabCut²⁰ (DLC) pose estimator with a ResNet101 backbone. We emphasise that other excellent markerless pose estimators, such as SLEAP¹⁶ or DeepPoseKit¹⁷, exist, and merely use DLC by way of example. The best tool will likely depend on the specific use-case. To test pose-estimation performance, two datasets of walking sunny stick insects were curated as test cases (Fig. 5). One dataset, denoted “platform”, represents a typical controlled case, where lighting and image background are constant, and only camera perspective varies: S. inexpectata were recorded walking across an evenly lit, tiltable platform, at 55 fps with five synchronised machine vision cameras (Fig. 5d, e). From these recordings, all 49 key points were hand-annotated in each of 805 frames (see Supplementary Table 10 for split details). The second dataset, denoted “handheld”, consists of 200 hand-annotated frames from ten handheld videos (20 frames per video), recorded with a cell phone at 25 fps (Fig. 5h). This dataset represents an uncontrolled case with variable recording conditions, and includes motion blur, perspective and magnification changes, out-of-focus frames, and frequent partial occlusions.

A DLC network trained on frames from all camera perspectives achieved a benchmark mean relative error on platform data of 10.9% across all camera views (Fig. 5g, see Eq. (4)). In remarkable contrast, networks trained exclusively on data from a single camera perspective produced a mean relative error of up to 86.4% for frames from unseen perspectives. This poor performance partially reflects domain-sensitivity, characteristic of many transfer learning approaches: the networks fail to generalise, because they have only been exposed to a small inference-specific dataset with limited variation.

A DLC network trained exclusively with synthetic data seemingly competes with the benchmark performance out-of-the-box: it achieved a mean relative error of 5.89% across all platform camera views (Fig. 5g). However, this low error is deceiving, as the network assigned low confidence scores to more than 50% of key points, which are thus excluded from the error estimate. However, the network provides an excellent starting point for refinement: the provision of a mere ten hand-annotated frames per camera orientation suffices to estimate key points with a mean relative error of 8.14%—better than benchmark performance.

The limited ability of networks trained on real data to generalise becomes even more apparent when they are put to work on recordings of the same species, recorded under different conditions. The pose-estimation network trained on the full platform dataset achieved a mean relative error of 77.18% on the handheld dataset (Fig. 5g). Key points were frequently placed more than two body lengths away from the specimen, demonstrating that key point detection strongly relies on recording-specific latent features; the volume and variability of the supplied training data was insufficient to embed a general specimen-specific understanding. In sharp contrast, the network trained solely on synthetic data achieved a mean relative error of 6.25% on handheld recordings—more than an order of magnitude smaller. Refinement with five randomly sampled frames from each video resulted in a mean relative error of 5.03%, close to the benchmark performance of 3.55%, achieved by a network trained on the full handheld dataset (Fig. 5g, i). An extensive quantitative comparisons of the performance across inference cases is provided in the supplementary information (see Supplementary Table 10 as well as Supplementary Table 11).

On the basis of the above, we conclude that the large sample size and variability afforded by synthetic data can meaningfully increase the domain-invariance and robustness of pose-estimation networks, and thus substantially reduce the required user effort: better or near-benchmark performance was achieved with 4-fold and 16-fold fewer hand-annotated samples in the handheld and the platform case, respectively (Fig. 5e–i, see also Supplementary Videos 5 and 6).

Semantic segmentation

We have demonstrated that synthetic data generated by replicAnt can substantially reduce the hand-annotation required to power accurate detection, tracking and pose-estimation, or even render it obsolete. Next, we show that it can enable inference in applications for which hand-annotation is so onerous that it is unlikely to be performed at the required scale for all but the most common objects: semantic segmentation, a computer vision task involving pixel-level classification (Fig. 6).

**Fig. 6: Performance of semantic- and instance segmentation networks trained exclusively on synthetic data.**

replicAnt was used to generate a digital population of 20 leaf-footed bugs (Leptoglossus zonatus, Dallas, 1852), based on a 3D model of an adult specimen produced with scAnt⁵¹ (Fig. 6a, b). Two synthetic datasets were generated, each encompassing 10,000 images with 1024 × 1024 px resolution (Fig. 6c): one dataset used replicAnt’s default parameters; for the other, replicAnt’s asset library was supplemented with various plant models from the Quixel Asset library (Epic Games, Inc.), in order to simulate the image content of typical field macro-photographs. Data generation took between 6 and 10 h on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super) for the default and plant case, respectively.

Images from both datasets were combined to form a single dataset, used to train Mask-R-CNN⁶, UperNet + SWIN transformer⁶³, and PSPNet⁶⁴ networks (see methods and Supplementary Table 12 for details). Producing large validation datasets is infeasible for highly specific semantic segmentation tasks—the very reason why synthetic data is so helpful in these applications. To provide an indicative performance metric, we extracted network masking accuracy for a small number of hand-annotated image examples via the Average Class-wise Recall (ACR. See Eq. (5)).

All trained networks were able to identify the majority of specimen pixels, and segmented few background pixels (Fig. 6d, e). Remarkably, PSPNet, the oldest tested architecture, produced the most accurate segmentations at both high and low magnification, and even in the presence of partially occluded or out-of-focus bodyparts: it achieved an ACR of 94.03% (Fig. 6d, e). Overall mask quality was lowest for Mask-R-CNN, which struggled with high-aspect ratio appendages at higher magnification and with images with higher background noise (ACR of 82.3%). This problem may in part be specific to the particular implementation of Mask-R-CNN, which was trained using lower resolution segmentation polygons instead of per-pixel segmentation mask encoding (see “Methods” for details). Mask-R-CNN does however additionally produce instance segmentations, useful where images contain more than one individual. To utilise this feature, a Mask-R-CNN network was trained on 10,000 synthetically generated images of leafcutter ants (Figs. 2a–c and 6f; see “Methods” for training details), and used to run inference on photographs of foraging Atta. The produced masks were of high quality, and contained few false positives (Fig. 6g).

Discussion

Deep learning-based computer vision methods promise to fundamentally alter what is possible in animal behavioural research^{12,13,14,15,16,17,18}. A key remaining bottleneck is the “data-hunger” of supervised learning techniques: annotated datasets of the size and variability required to achieve robust, domain-invariant performance are rarely available, and in any case time-intensive to produce^44,65. One strategy to overcome this limitation is to produce annotated data synthetically, using sufficiently realistic computer simulations^{30,31,32,39,40,42,44,47,48,49,50,66}. In order to facilitate this process, we developed replicAnt: a synthetic data generator built in Unreal Engine 5 and Python. replicAnt is designed to run on consumer grade hardware, and can generate around 1000 annotated images per hour. We provide extensive documentation, parser scripts for popular deep learning frameworks, pre-trained networks for all listed applications, benchmark datasets, additional software to aid automated detection-based buffer-and-recover tracking and 2D pose-estimation⁵⁶, and a growing library of ready-to-use 3D animal models.

The utility of replicAnt was demonstrated by using it to train deep neural networks for stereotypical tasks in animal behavioural research. In multi-animal detection, tracking, and semantic segmentation, networks trained exclusively with synthetic data achieved a performance sufficient to remove the need for hand-annotation altogether. In markerless pose-estimation, pre-training networks on synthetic data increased the subject-specific understanding of the networks, so enabling a reduction of the amount of hand-annotation required to achieve benchmark performance by more than one order of magnitude (Fig. 5g). The resulting reduction in time costs enables broad comparative studies—of key biological importance, but currently absent from the literature⁶⁷. We hope that open sharing of 3D models, test data and trained networks will decrease the need for case-specific refinement, and eventually lead to powerful “generalist” networks.

Ample of opportunity for expansion of replicAnt exists. For example, the combination of depth passes and camera intrinsics and extrinsics can in principle be used to train networks to infer 3D locations directly from a single 2D image^36,40,68; informing posture variation within replicAnt with 3D kinematics data from live animals^30,49 may yield networks which can infer the location of occluded key points with reasonable accuracy; and further annotation, for example automatically labelling minute morphological species differences or body size, can readily be appended. Recent advancements in style- and domain transfer could be combined with data produced by replicAnt to produce even stronger generalist^30,31,66, or application specific networks^47,48,49. The domain-gap may be narrowed further by introduction of novel pre-trained networks, such as Segment Anything⁶⁹ and DINO(v2)³³, as feature-extraction backbones.

Although in principle applicable to any subject model, the experiments and results presented in this study focused primarily on terrestrial arthropods. Further functionality, such as a quadruped armature and models^46,49, animation blueprints, and use of Unreal Engine’s soft body and fur simulation capabilities for realistic renderings are planned, to further decrease the entry barrier, and emphasise accessibility and modularity.

replicAnt provides a fertile testing ground to further our fundamental understanding of supervised learning. As an illustrative example, replicAnt can in principle generate arbitrarily large datasets. Not all images are created equal, however, and the control over environmental variability, combined with the ability for “mixed training”, provides an excellent opportunity to probe which image elements are most effective for accelerated network fitting. Ultimately, it is our hope that replicAnt represents a significant step towards porting machine-learning based computer vision tools to the field.

Methods

The generation of synthetic datasets with replicAnt can be divided into three steps: (i) 3D Subject Model creation and preparation; (ii) set-up of the generator in Unreal Engine; and (iii) parsing generator outputs into common machine learning data formats to train deep neural networks of choice. In the following sections, we first outline the general structure of the data generation process; application specific details are provided at the end. A glossary is provided in Supplementary Note 1, and detailed documentation and interactive Jupyter notebooks for data parsing and benchmarking are available on GitHub (https://github.com/evo-biomech/replicAnt).

3D subject models

In principle, 3D models used in replicAnt can come from any source. However, model fidelity is a primary determinant of the simulation-reality-gap, and thus influences the performance that can be achieved. Practically, the required model fidelity depends on the desired application. For example, in applications which typically involve low resolution footage, such as multi-animal tracking (Figs. 2–4), lower fidelity 3D models may suffice; networks used for high-resolution semantic segmentation (Fig. 6) or pose-estimation (Fig. 5), in turn, require models with higher fidelity. We used the open-source photogrammetry platform scAnt⁵⁵ to produce high fidelity models, and Blender v3.1 to sculpt lower fidelity models. Irrespective of model origin, the model mesh may need to be cleaned, retopologised, and rigged prior to import in the generator. For all models used in this study, this process was completed using Blender v2.92 & 3.1.

Clean-up

All unconnected vertices and floating artefacts of the model mesh were deleted, and surfaces cleaned, using Blender’s native editing and sculpting tools. Holes were closed by collapsing surrounding vertices to a single point, and/or rebuilding the surrounding topology. The re-connected mesh regions were made seamless by projecting adjacent texture information onto the collapsed or newly created area, respectively; any overlapping vertices and self-intersecting faces were removed (for further details, see ref. ⁵⁵).

Retopologising

To accelerate the data generation process and to allow for larger simulated digital populations, we decreased the mesh resolution of each model to between 1000 and 10,000 vertices, using Blender’s native decimate modifier. The number of vertices was chosen such that the overall shape was preserved, but fine topology information, such as hairs and other surface detail were removed; this information was captured by albedo and normal maps instead, generated for the retopologised meshes through texture baking from the original high-resolution input⁵⁵.

Rigging

In order to enable posture variation, models were rigged—each model was assigned a set of rigid segments referred to as bones, with individual bones connected through joints. The collection of bones and joints defines the model’s armature. In principle, users can assign an arbitrary number of virtual bones and joints, each with a specific range of motion. We provide a base armature template that was used throughout this work; it can readily be adapted to animal-specific needs (see Supplementary Fig. 1). The segment deformation associated with joint movement was restricted to proximal parts of each mesh segment using weight painting, an appropriate simplification due to the effectively rigid arthropod exoskeleton.

Model porting to Unreal Engine

Curated models, including materials, textures and the armature, were transferred from Blender to Unreal engine 5 via the send-to-unreal Add-on (https://github.com/EpicGames/BlenderTools). Rigged meshes can also be imported directly into Unreal Engine 5, but may then require additional manual editing. Regardless of model origin, process shader, scaling, collisions properties, and animation blueprints need to be assigned to the subject model after import. Examples of configured 3D subject models as well as detailed documentation on the model preparation process are available on GitHub (https://github.com/evo-biomech/replicAnt).