Artificial neural networks (ANNs) were first developed to understand biological behaviour and mechanisms of cognition1. Turing believed that a crucial measure of artificial intelligence (AI) is the ability to mimic the way in which complex behaviour becomes organized in infants2. Given recent technical advancements in computing and AI as well as theoretical advancements in infant learning, it may be possible to use machine and deep learning techniques to study how infants transition from loosely structured exploratory movement to more organized, intentional action. Thus far, such methods have focused on analyzing spontaneous movements3 and distinguishing fidgety from non-fidgety movements4.

Though early infant movement is chaotic, meaningful patterns emerge as infants adapt to external perturbations and constraints through dynamic interaction between brain, body and environment5,6. However, little is known about the mechanisms by which infants begin to intentionally act on functional relationships with their environment. Laws governing bidirectional interaction between infant and environment are lacking, and the roots of conscious, coordinated, goal-directed action remain largely unexplored6,7.

A paradigm designed a half century ago to study infant memory and learning provides an experimental window into the formation of human agency, action towards an end6. In this so-called mobile conjugate reinforcement (MCR) paradigm, Rovee et al. connected a ribbon between an infant’s ankle and a mobile suspended over the infant’s crib8,9. Conjugate reinforcement refers to the sights,noises, and sensations due to mobile movement all being ostensibly dependent on and in proportion to the magnitude and rate of infant action. In short, the thinking was that the more the infant moved, the more ‘reward’ the mobile provided, stimulating further infant movement. Infants moved the connected leg at much higher rates compared to baseline, which Rovee and Rovee8 interpreted as reinforcement learning. However, mounting evidence suggests that rather than being rewarded by mobile stimulation per se, the increase in infant movement is driven by infant detection of the self~mobile relationship6,9,10,11,12,13,14,15,16,17. The key variable manipulated in MCR is the infant’s functional connection to the world - transforming the infant from a disconnected observer to a connected actor. Bidirectional, coordinated information exchange through coordination is thought to generate meaning and create the opportunity for infant discovery of agency6,10,18,19.

If infants do in fact discover that they can ‘make the world behave’ in MCR, dynamical analysis should expose the mechanisms of the discovery process. A necessary step in studying the development of sentient agency is being able to detect structural changes (in time, space and function) related to goal-directedness and to differentiate between exploratory and goal-directed action. One hypothesis is that the moment of agentive realization (‘Aha! moment’) constitutes a kind of phase transition marked by sudden changes in activity rate, coordination and variability10,18. Given recent development of dynamical tools to identify infant agentive discovery in the infant~mobile paradigm and related results which support the notion of infant agentive discovery as a phase transition dependent on tight organism~environment coordination20, it may be possible for AI systems to automatically detect these and/or other changes in infant movement patterns reflecting detection of baby~world causal relationships. Though the target measure in most infant contingency studies is movement rate, some studies have found that infants modify multiple features of movement including amplitude, timing21 and inter-joint coordination22,23 while exploring and exploiting their functional relationship with the mobile. AI tools may be particularly suited to deal with the complexity and subtleties of infant movement and, more generally, agent~object interaction in 3D space.

A variety of machine and deep learning methods can be implemented for pose recognition using video recordings of infant movement. For example, McCay et al.24 extracted posed-based features from video recordings to develop an automatic and independent method to diagnose cerebral palsy (CP) in infants. They used a freely available library called OpenPose25 to obtain skeletal joint coordination during sequences of movement in 12 infants up to seven months of age26. That study achieved nearly 92% accuracy (two classes) using machine learning techniques (k-Nearest Neighbor (kNN) and Linear Discriminant Analysis (LDA)) and 91.7% accuracy using a fully connected convolutional neural network (FCNet). Additionally, Tsuji et al.27 constructed a neural network with a stochastic structure that was able to distinguish between normal and abnormal infant movements with up to 92.2% accuracy. (See also3,4).

In general, deep learning structures that use convolutional neural networks (CNNs) have effectively achieved state-of-the-art accuracy for classification of adult action28,29,30. In order to complete CNN feature extraction and classification automatically, regions of an output feature map are pooled. Though typically employed to reduce the computational cost of the model, such poolings may also result in a loss of important information. Two common approaches are max or mean pooling. In max pooling, a filter (a 2 × 2 grid, for example) is slid over the output feature map and only the maximum value in the grid area is retained. (In the example of a 2x2 filter, output data are reduced from four values to one). Although a useful means to simplify the network, it is impossible to know from poolings alone where and how many times a filtered feature is encountered in the data. A new form of neural network, the capsule network (CapsNet) was developed to address this issue31,32. Using groups of artificial neurons (i.e., mathematical functions designed to model biological neurons that encode visual entities and to recognize the relationships between these entities), CapsNets are designed to model part-whole hierarchical relationships explicitly. CapsNet encapsulates artificial neurons in its vector structure to arrange the first layer (primary capsules) and uses a novel Dynamic Routing (DR) procedure to create a perfect route between this layer and subsequent layers (parent capsules). In an image classification task, a hierarchical relationship is built into an object in the image so that it is possible to interpret features and determine which part of any image belongs to a particular object. For example, DR might allow a CapsNet to not only assess whether elements of a face (e.g., eyes, lips, nose) are present but also whether these elements are realistically situated in relation to one another. Lastly, CapsNets can be trained more efficiently than traditional CNNs. Because CNNs cannot handle rotational invariance, they must be trained on large amounts of input data which have been augmented by many combinations of transformations (e.g., rotation, zooming, cropping, inversion) to be able to classify new data accurately. Critically, since CapsNets can handle rotational invariance, they can be trained with fewer samples and may be particularly suited for infant action recognition for which large datasets are difficult to acquire33.

While techniques in motion identification, reconstruction, and analysis for automatic recognition of specific human activities continue to improve34,35,36, classifying human action patterns is challenging because these patterns often involve both temporal and spatial characteristics37. Joints connect different segments in the human body as an articulated system, and human actions comprise the continuous development of the spatial structure of these segments34. Though the majority of AI action recognition research focuses on adult movement, researchers have begun developing systems to automatically analyze movement of paediatric populations, including infants3,4,38,39. However, automatic analysis of infant behaviour is complex since data are often captured in uncontrolled natural settings while the infant is freely moving and interacting with a variety of objects40. Handling information such as infants’ physical variations, lighting changes, and examiner involvement results in a lack of robustness in classification for markerless capture of movement data (i.e., estimating 3D limb position and movement from video recordings). Optical flow, frequency domain, and background removal are techniques commonly used to deal with these challenges41,42,43. While pose estimation from video recordings can deliver high accuracy, the extracted image sequences provide a large amount of data, but at a high computational cost.

Alternatively, reconstructed 3D skeleton data from marker-based motion capture (MoCap) systems have been shown to be dynamically robust and have anti-interference properties for automatic action classification44. Although consistent positioning of physical sensors (i.e., visual markers, accelerometers, gyroscopes) across infant participants is difficult, as joint landmarks are often obscured by fat, and the application of sensors may modify infant behaviour, skeletal joint information extracted from MoCap systems that use physical markers provides high temporal resolution and extremely high accuracy45,46. For example, Yu et al.47 used an adaptive skeleton definition to translate and rotate the virtual camera’s viewpoint to generate a new coordinate system. This produced a robust and adaptive neural network for automatic, optimal spatiotemporal representation. In many respects, action recognition is more straightforward using skeletal joint information than RGB video imagery, and hence is preferred here.

The data used in the present work were obtained from a baby~mobile experiment which quantitatively identified moments of agentive discovery and explored underlying mechanisms through analysis of infant motion, mobile motion and their coordination dynamics20,48. Movement data were collected using a Vicon 3D MoCap system. Given the nature of this experiment and the fact that the MoCap data provide exact infant joint locations, several machine and deep learning approaches for classifying pose-based features are proposed and evaluated here. The first main objective is to classify infant movement across different experimental stages, ranging from spontaneous activity to reactions to an externally driven moving stimulus (here a mobile) to make the mobile move to losing control over the mobile. Successful classification would indicate that the functional context and infant sensitivity to context drive changes in the structural features of infant movement. The classification accuracy of different experimental stages will highlight the pros and cons of applying different machine and deep learning approaches in infant studies and, at the same time, expose whether behaviour is more constrained and characteristic within certain contexts, leading to higher accuracy classification. The second main aim is to study temporal features of infant behaviour using sliding windows. To reiterate, the infant’s discovery that ‘I can make the mobile move’ emerges from a coordinative dance between organism and environment which unfolds in time20. Assessing classification accuracy using sliding windows will provide new insight to the dynamic topological evolution in infants exploring and discovering their relationship to the world. With our unique dataset and in the context of related evidence of infant agentive discovery, we examine and optimize machine and deep learning methods to characterize the flow of structural change in infant movement reflective of cognitive processes and behavioural adaptation within and across coordinative contexts. We demonstrate that approaches based on deep learning are well-suited for working with pose-based data. In particular, we show that CapsNet-based approaches, such as 1D-Capsule Network (1D-CapsNet) and 2D-CapsNet preserve the hierarchy of features and avoid information loss in the model’s architecture by substituting pooling with dynamic routing especially when fused features were employed. More to the point, we demonstrate that AI systems provide significant insight into the early ability of infants to actively detect and engage in a functional relationship with the environment.



Sixteen reflective markers were placed on 3-4-month-old babies (N = 16) to track infant movement. All experimental protocols were approved by Florida Atlantic University (FAU) and Florida Department of Health (DOH) Internal Review Boards. This experiment was performed in accordance with FAU and Florida DOH guidelines and regulations. Informed consent was obtained from a parent and/or legal guardian of infant for both study participation and publication of identifying information/images in an online open-access publication. The marker arrangement can be seen in Fig. 2a. In each trial, the infant was placed in a crib face-up, with a mobile suspended above. The mobile consisted of two colourful blocks attached to the ends of a wooden arm. Seven infrared cameras surrounding the crib tracked each marker’s position at a rate of 100 Hz using a Vicon MoCap system. Three video cameras placed at different angles were also used to record each session. The experiment consisted of four experimental stages.

In the spontaneous baseline (2 min. long) the mobile was stationary. In the uncoupled reactive baseline (2 min.), the experimenter triggered mobile movement independent of infant movement. Two strings were connected to the sock of one foot (the trigger foot, visible in Fig. 2b), commencing the tethered phase (5-6 min.). The side of the body selected as the trigger foot was randomized. When the strings connected to the trigger foot were pulled, the mobile rotated. Infant trigger foot movement was digitally translated into rate of mobile rotation. The strings were then detached, leaving the mobile stationary once again constituting an untethered phase (2 min.). These four stages respectively measure infant activity (1) when there was no mobile motion and infant activity was spontaneously generated, (2) when mobile motion was an externally driven stimulus that was not affected by infant motion, probing the infant’s reaction to the raw stimulation a moving mobile provides, (3) when mobile motion was a direct result of infant motion, and (4) when infant motion no longer resulted in mobile motion (Fig. 1). An illustration of the extracted 3D infant skeleton is shown in Fig. 2c.

Figure 1
figure 1

(a) Infant marker layout. (b) Infant in crib with mobile above. A 3D reconstruction of the markers is overlaid on top of the video recording. All coloured lines represent the virtual skeleton of the ‘baby-mobile object’ being tracked. The coloured dots are the reflective markers. (c) A 3D representation of another infant without the video feed.

Figure 2
figure 2

Schematic of the experimental paradigm. The experimental stages which manipulate the infant’s functional context proceed from left to right. See the Experiment description.

Pipeline overview

An overview of the feature extraction and classification is shown in Fig. 3. As will be described in the following section, we first performed preprocessing to ensure the accuracy of the timeseries for the keypoints¸ the set of 3D movement markers extracted by the MoCap system. Important information was then extracted from preprocessed sequences categorized into distinct experimental stages. In the next step, we used a spherical coordinate system to segment the 3D joints into an n-bin histogram. A histogram-based approach was employed to create pose-based feature sets called Joint Displacements (JDs) to feed as inputs to the networks for experimental stage classification. The following sections of the paper explain each of these steps in detail. Finally, the machine and deep learning approaches tested here are described and justified.

Figure 3
figure 3

Overview of the feature extraction and classification steps.

Dataset description and preprocessing

The input data for baby movement pattern recognition are a sequence of 3D skeletal joint positions (keypoints) collected during the experiment using a marker-based MoCap system collecting at 100 Hz. For the current analysis, five infants were selected from the larger pool of participants because these infants had full datasets for all necessary markers throughout the experiment. Due to challenges inherent in motion capture with infants, such as their small size and rapid, unpredictable, and sometimes contorted movements, some markers were occasionally obscured or not captured effectively by the Vicon system. Additionally, infant compliance is not guaranteed for the experiment’s entirety. An infant was excluded if a single marker was missing for the duration of any phase of the experiment. Consequently, only data from five infants met the stringent criteria for data completeness necessary for our analysis. The Vicon MoCap system used to collect these data provides a multi-view video dataset as well as a three-dimensional keypoint locations for baby movements that can compensate for RGB images’ disadvantages49,50. To visually introduce the data, using frames of video the position of the baby’s limbs changed constantly throughout the experiment (Fig. 4). Figure 5 presents a single frame from each of the four stages. Figure 5a, b, and d are examples where the ribbon is not connected to the trigger foot. In (a) and (d), the mobile does not move. In (b), the mobile moves when triggered by the experimenter. The baby can only activate the mobile once the ribbon has been attached to the foot visible in (c). To clarify, however, only 3D keypoint data (not video) was used for AI training and analysis.

Figure 4
figure 4

Different frames from one video camera viewpoint of subject 1. Reflective markers are easily visible on this infant’s right leg.

Figure 5
figure 5

3D reconstructed tracking during four different stages in the experiment: (a) spontaneous baseline-no mobile movement (b) uncoupled reactive baseline - experimenter moves mobile (c) tethered stage (d) decoupled stage.

In every frame of data, the keypoints are labelled based on the anatomical locations of the movement markers in the frame. These keypoints are as follows: Head, C Pelvis, L/R Pelvis, L/R Shoulder, L/R Hand, L/R Hip, L/R Thigh, L/R Knee, L/R Ankle, L/R Foot (see Fig. 6). However, a subset of these skeletal joints were included in the current study since complete data were available for only some of the markers. For the current project, we used marker data for 12 keypoints: Head, Center Pelvis, L/R Hip, L/R Shoulder, L/R Hand, L/R Knee, and L/R Foot. As a result, the dataset contains 12 3D sequences.

Figure 6
figure 6

x, y, and z coordinates of body landmarks.

Various hurdles to 3D motion capture exist with infants such as marker jitters either due to system errors, extraneous reflections, or the mode of affixing markers to the infant’s body. For example, socks are made from stretchable fabric, allowing markers attached to the sock to wobble more than markers which are taped directly to the skin (a more rigid connection, that babies don’t particularly like). Other challenges include insufficient coverage of the infant’s limb motion either due to occlusion of the markers by vicissitudes of the experimental setup (i.e., at times, the mattress occludes almost 50% of the infant’s body, the mobile feedback system and its support beams) or by other body parts (i.e., stacking one foot on top of the other, grabbing feet with hands). The foregoing issues along with the complexity of the infant’s posture (e.g., crossing legs) can prevent the Vicon system from correctly identifying and tracking the keypoints.

Missing or incorrect marker data were handled using interpolation and filtering. Following that, we extracted features from the keypoint coordinate sequences that were preprocessed. Although keypoint missing or estimation errors may exist in a specific sequence input {\(J_{it}(x,y)\) \(i=1,2, \ldots ,12\)}, there are a relatively small number of frames with errors in estimation. The three times standard deviation technique, where \(\sigma\) is the standard deviation of the sequence {\(J_{it}\)}, was utilized for error elimination (1). Here \(E_J\) is the average of {\(J_{it}\)}.

$$\begin{aligned} J_{it}(x,y)= {\left\{ \begin{array}{ll} J_{it}(x,y), \quad \text {if } E_J-3\sigma \,\, \le J_{it} \le \,\,E_J+ 3\sigma \\ nan,\quad \text {if } J_{it} > E_J+3\sigma \, or \, J_{it}< E_J - 3\sigma \end{array}\right. } \end{aligned}$$

The trajectory of a joint’s location is continuous, and its velocity fluctuates uniformly during movement. Cubic spline interpolation was employed to fill in missing points. Ignoring frames with missing data may lead the actual movement to deviate from the subsequent calculation of movement complexity.

Dataset annotation

A primary aim of this work is to classify infant behaviour across experimental contexts and to explore whether structural changes in infant movement detected by AI converge on coordinative and dynamical indicators of infant agency. Importantly, dynamic and coordinative indicators of agentive discovery suggest that most infants’ Aha!’ moments occurred after 90 seconds of infant~mobile coupling20. Therefore, we split the tethered phase across time to investigate whether the various classification architectures employed here could differentiate between infant initial exploratory movement during interaction (i.e, CR1 - the first minute of infant~mobile coupling) from possible intentional activity later in the tethered phase (i.e, CR2 - beginning at 120s into coupling) (see Figure 7). Movement sequences were divided into five stages and annotated as follows: B1 (spontaneous baseline, no mobile motion), B2 (uncoupled reactive baseline, experimenter-triggered mobile motion), CR1 (the first minute of the tethered stage), CR2 (~ 2 min. into the tethered stage), DC (decoupled, untethered stage). For each infant, the five stages (B1, B2, CR1, CR2, DC) are labelled as 0 to 4, respectively. In subjects 1, 3, and 5, the ribbon was connected to the left foot; in S2 and S4, the right foot was connected. Figure 8 depicts the available data for all five infants across all stages. Due to the nature of experiments with human babies, uneven stage lengths are inevitable (e.g., CR1 for S4 contains only 18.94s of data).

Figure 7
figure 7

The top panel shows an idealized progression of experimental stages depicted in different colours (red, yellow, green, and blue represent spontaneous baseline (B1), uncoupled reactive baseline (B2), tethered stage (CR) and untethered stage (DC), respectively). The entire procedure was planned to span 12 minutes. To investigate whether various AI classification architectures could differentiate between initial infant exploratory movement during tethered interaction from possible intentional activity later in the tethered phase, the current project split the tethered phase into two segments surrounding the most typical timing for infant discovery as indicated by coordination dynamics analysis (i.e., ~ 90s into coupling)20. These two segments were the first minute of infant~mobile coupling (CR1, shaded light green) and the third minute of coupling (CR2, shaded dark green). Outlined in black, we planned to classify movement using two minutes of data for each of B1, B2, CR and DC (with CR split into CR1 and CR2). The bottom panel illustrates the available data for each stage for each of the five infants (S1-S5) relative to the idealized plan. Sliding windows (window width = 5s; overlap = 1s) were used across all experimental stages to assess fluctuations in classification accuracy across time (indicated by small black windows drawn to scale and arrows).

Spherical coordinates of movement directions

Vicon data represent the position of markers relative to an arbitrary origin of the 3D capture space. A map of body-centric displacements can be created by designating a root joint marker and calculating displacement of other joints relative to the root. Here, we transferred the Center Pelvis to the centre of our spherical coordinates (the root) and then calculated all joint displacement between each joint and the centre (0, 0, 0). Thus, all analyses were run on joint displacements calculated from body-centered positional 3D motion capture data. A modified spherical coordinate system was then used to partition 3D space into n bins. Spherical coordinates align with the infants’ specific movement directions1. To elaborate, spherical coordinates are assigned to be view-invariant, meaning descriptors of the same type of infant position are similar even when collected from different perspectives (Fig. 8). To create a compact representation of infant positions, we chose six informative joints (L/R hand, L/R knee, and L/R foot). The 3D space is partitioned into n bins by reference vectors alpha and theta, horizontal from L Pelvis to R Pelvis and perpendicular to the coordinate centre; therefore, any 3D joint can be localised at a specific bin51,52.

Figure 8
figure 8

Keypoint locations are recalculated in reference to the pelvis to produce person-centred data (left). The partitioning of the 3D space into n bins is referenced by vectors alpha and theta (right).

Estimation of the poses using histogram-based features

Action recognition techniques based on histograms using a set of feature vectors extracted from the skeletal joint locations allow us to estimate 3D pose in lower dimensions relative to standard methods25 suited for pose estimation from 2D video data (e.g.,53,54). Histogram-based features derived from baby joint displacement have been successfully employed in a deep learning architecture to classify infant movements55. Using this approach, the displacement of each joint is extracted for a 5s snippet, and then a feature called Histogram of Joint Displacement (HJD) is obtained. HJDs track which particular joint occupies each spherical coordinate and for how much time. These HJDs are used as inputs for the machine/deep learning systems tested here.

Classic machine learning approaches

We used k-Nearest Neighbour (kNN) and Linear Discriminant Analysis (LDA) as our classic machine learning approaches to classify the different stages based on changes in Histogram Joint Displacement. kNN uses the distance between each test point and k training points to determine which class best describes the data being tested. The test point is classified as belonging to whichever class of training data is most predominant in the k training points surrounding the test point. LDA models the decision boundary between classes. It identifies a linear combination of features that best differentiates the classes by maximizing the ratio of between-class variance to within-class variance. When applied to new data, this linear combination of features serves as a linear classifier. LDA operates on the Gaussian Distribution of input variables and can be used for both binary and multi-class classification. Both approaches have been previously applied to infant pose-based features24, providing a reasonable baseline for our evaluation performances.

Deep learning approaches

Recent studies have demonstrated that deep learning frameworks can be successfully applied to human action recognition30. However, several obstacles exist, mainly the large amount of data required to enable deep learning and the difficulty of creating explainable AI when applying deep learning to infants’ activities and their healthcare domain56,57. Understanding how a framework makes decisions is critical in these domains. Yet, deep features in deep learning frameworks are often incomprehensible to humans. Hence, our deep learning frameworks employ hand-crafted pose-based feature sets to classify infant movements. The histogram vectors of the pose-based features were evaluated using different deep learning architectures, including a Fully connected network (FCNet), 1D-Convolutional Neural Network (1D-Conv), 1D-Capsule Network (1D-CapsNet), 2D-Conv, and 2D-CapsNet. A range of hyperparameters for each architecture was optimized using a cross-validation approach. The result section describes the model’s parameters and other statistics such as the size of the input and output layers, kernels, and the number of filters.

In addition to the comprehensive exploration of deep learning frameworks and the challenges faced, we have provided detailed explanations and specifications in the Supplementary Information, demonstrating the complexities of our model’s parameters, input and output layers, kernels, number of filters, and other relevant statistics.

Training and testing inputs: dealing with imbalanced data

Using five subjects with uniformly available joint data we employed a leave-one-out (i.e., one subject) validation approach. The leave-one-out approach involved training each model using four out of the five infants’ HJDs and testing model accuracy using the infant HJD dataset which was left out from training. This procedure was repeated five times so that each infant dataset was used for testing once. Given the uneven quantities of available data for each experimental stage across infants (see Fig. 7), the data were down sampled for each instance of training to the shortest stage length among the four infants used for training. For example, when subjects 1-4 were used for training, the shortest stage for any of these four infants was CR1 for S4 (18.94s). Therefore, all stages for subjects 1–4 were downsampled to 18s, producing an equally balanced training set across infants and across stages. In our training process, we selected 1794 samples for each classification stage, with the quantity restricted by the count observed in stage CR1, which exhibited the lowest sample number. The remaining stages were randomly downsampled to ensure balanced labels. We obtained accuracy percentages for each instance of testing and averaged them (Table 1). In addition, to better understand the temporal evolution of infant movement, an overlapped window sliding strategy was employed that used a window width of 5 seconds with 1 second overlapped across all experimental stages. This allowed us to assess fluctuations in classification accuracy across time throughout the experiment (illustrated with small black windows and arrows in Fig. 7). Average classification accuracy on sliding windows is reported as a measure of model performance. Analysis of variance (ANOVA) is used to determine the significance of differences between the results of each method (\(\textrm{p}<0.05\)).


Classification accuracies

Table 1 presents the average classification accuracy for the various machine and deep learning approaches tested. The approximate chance level for accurately classifying the five experimental stages used here is 20%. Table 1 also reports accuracy levels for four fused features. This set of fused features was created by concatenating the histogram features based on the number of joints available in the datasets, forming a long 1D vector. The four resultant fused joints were (1) Hands: fused features from L &R hand, (2) Knees: fused features from L &R knee, (3) Feet: fused features from L &R foot, and (4) Full-body: fused features from L &R hand, L &R knee, and L &R foot.

The classification accuracy achieved for L/R foot individually and fused Feet was superior to that of all other feature sets. In particular, 2D-CapsNet achieved the highest accuracy of 86.25% for the fused Feet feature set. In terms of mean joint type accuracy, all foot features (i.e., L, R and fused) scored significantly higher average accuracy across all evaluated classification methods compared to other joint types \((F(2,48) = 34.65, p < 0.0005)\).

LDA and kNN achieved highest accuracy for single joint features. For example, LDA achieved the highest accuracy of 59.63% and 75.63% for the left hand and left foot, respectively, whereas kNN achieved 58.00% and 61.63% for the right hand and left knee, respectively.

However, when we considered fused features, deep learning approaches outperformed LDA and kNN. All fused features improved in accuracy using deep techniques. When spatial information between different body parts was used during movement, 2D deep approaches performed better: 2D-Conv and 2D-Capsnet achieved an accuracy of 59.57% and 86.25% for Hands and Feet, respectively. Even though 2D-CapsNet achieved a lower accuracy for Hands (50.65%), a comparison of the mean accuracy per classifier demonstrated that 2D-CapsNet maintained the highest mean accuracy at 65.65. Also, the fused feature of Full-body that represent the most general representation of 3D skeletal joint information achieved the highest accuracy (65.51%) using the 2D-CapsNet approach. In sum, though all models tested performed well above chance (20%), 2D-CapsNet models using fused feature inputs made the most accurate classifications.

Table 1 Performance of all models: average sliding window accuracy (%) with standard deviation.

Using the 2D-CapsNet to assess stage transitions

A sliding window was used to analyze each infant’s movement in conjunction with the leave-one-out approach. For each infant, we shifted a window of 5s with 1s overlapping across one stage while keeping other windows stationary in the remaining stages. Figure 9a–e shows the moving average of the temporal analysis for each infant using the fused Feet feature. As can be observed, five ascending steps illustrate the correct labels for five distinct classes. Stage B2 received the most accurate labels, indicating that it is easier to detect Histograms of Joint Displacements (HJDs) accurately compared to other stages. This observation is echoed by the confusion matrix, where Stage B2 consistently received the most accurate labels (refer to Supplementary Information Figure 8). Additionally, the classifier did not perform well when it came to identifying spontaneous movement in stage B1, indicating that infants moved variably such that the system detected movements in B1 which were like movement patterns in multiple other stages. This becomes more evident when considering the moving average of classification accuracy for temporal analysis of fused features Knee, Hand, and Feet (Figure 10). B2 has a higher overall classification accuracy over the whole period than the preceding stage. Additionally, as we moved along after decoupling (in stage DC), we observed a decline in classification accuracy, indicating that movements were more irregular after the mobile stopped responding to infant movement.

Figure 9
figure 9

Temporal analysis for each subject. Five ascending steps illustrate the correct labels for five distinct classes: B1 (spontaneous baseline, no mobile motion), B2 (experimenter triggered mobile motion), CR1 (minute 1 of coupling), CR2 (2 minutes into coupling), and DC (decoupled, no mobile motion), respectively. Completely flat lines on each ascending step would indicate perfect labelling for each stage. 100 sliding steps are equivalent to 1s.

Figure 10
figure 10

The moving average of the classification accuracy across time (s) for fused features (Full-body, Knees, Hands, and Feet). B1 (spontaneous baseline), B2 (uncoupled reactive baseline), CR1 (the first minute of the tethered stage), CR2 (~ 2 min. into the tethered stage), DC (decoupled stage).

In addition, a higher rate of misclassification between each stage and the stage immediately following was observed (see Figure 8 of the Supplementary Information), which is a common occurrence in classification tasks involving closely related categories. This can be attributed to the inherent similarities in features characterizing neighbouring stages. In the first baseline (B1), babies moved spontaneously in an undifferentiated manner, causing misclassification with later stage activity, notably B2. Many infants froze, reducing their activity when the experimenter triggered the mobile during the B2 (cf. Table 2, Figs. 11, 12). Similarly, when infants were first tethered to the mobile (CR1), infants alternated between moving and freezing20, so confusion between B2 and CR1 is understandable. However, periods of freezing progressively shortened in length over the course of the first minute of tethering as infants acclimated to the mobile and their activity ramped up. Since CR1 and CR2 are samples derived from the same overarching experimental stage, it is reasonable to anticipate shared features in babies’ movements during these stages. Finally, infants generally remained quite active after the tether was disconnected (stage DC). Classification accuracy was lowest in the DC stage. Given the proximity of CR2 to DC in the classification hierarchy, it is reasonable to expect that misclassifications would predominantly involve labels transitioning to the DC stage. In sum, the directionality of misclassification likely reflects this flow of the experimental design and the fact that these data represent the behaviour of naïve 3-month-old infants exploring and adjusting their behaviour within a novel and changing functional context.

Feature analysis

We first computed the average displacement rates of both feet across the stages to further understand the significance of higher classification accuracy levels observed during B2 and across CR (see previous section). As can be seen in Table 2, the feet were significantly more active during DC compared to the other stages (F(4, 16) = 3.52, p = 0.03, sphericity assumed), and infants tended to move their feet less during B2 compared to the other stages. Based on average foot movement rate alone, accurately classifying B2 and DC foot behaviour would appear to be easier task compared to classifying stages B1 and CR which had similar movement rates.

Infant learning during coupling has classically been defined as 150% increase in movement rate relative to spontaneous baseline8. While this cut-off was not met by these five babies as a group, the infants do increase movement rate roughly 150% during tethering relative to the second baseline when the experimenter moved the mobile. Given that both the coupled phase and second baseline both involve mobile movement, the second baseline is a fair comparison point contextually speaking. Furthermore, averaging movement rate across babies does not address individual behaviour or discovery processes. Finally, it is possible that infants discover specific trajectories that are particularly suited to elicit mobile response. Upon discovering the relationship between foot movement and mobile motion, infants may adjust foot movement topology rather than (or in addition to) foot movement rate. These infants do in fact change their foot trajectories as is reflected by the high accuracy rates depicted in Fig. 11.

Table 2 Average Displacement Rate of Feet (m/min).

To explore topological changes in infant joint movement, we examined the actual trajectory of the connected feet for Subject 1 (left foot) and Subject 2 (right foot) (Figs. 11, 12). As can be observed, random movements in stage B1 became more constrained in both subjects after the examiner activated the mobile (B2). Topologically, infants produced almost the same movement trajectories for stages CR1 and CR2. S2 maintained similar activity in DC compared to CR, indicating lasting effects of baby-to-world connection even after that connection is severed.

To investigate how changes to features of infant movement directly inform deep learning systems, we quantified the HJDs of all joints for S2 over 10s in the middle of each stage \((bins = 64)\) (see Fig. 13). Note that whereas the x-axis of each histogram reflects the number of bins (or the range of 3D space) a joint occupies, the y-axis reflects the amount of time the joint occupied a particular bin. Therefore, a large HJD value constrained to a single bin reflects little joint motion or, at most, very topologically restricted motion. Although most joints during B1 are clearly separated to the left and right of the histogram graph space in accordance with their anatomical sides (e.g., left hand (blue) occupies bins on the left of the histogram), the right foot (black) is visible in many bins between bins 8–57, implying random movement of the right foot in stage B1. The right foot movement became more restricted throughout stage B2 (bins 48 and 56), supporting the observations made in the preceding section (Results, C). There was a noticeable increase in the range of this joint’s position across the CR stage (activity was between bins 46–57 in CR1 and expanded between bins 32–57 in CR2). Although the foot was active in a larger volume in CR than B2, activity became constrained again in the DC stage, like B2. Importantly, this evolution of activity was only observed in the foot linked to the mobile (the right foot) and not in any other limb segments.

Figure 11
figure 11

The trajectory of the left (connected) foot in S1 as a function of experimental stage. Stages: B1 (spontaneous baseline), B2 (uncoupled reactive baseline), CR1 (the first minute of the tethered stage), CR2 (~ 2 min. into the tethered stage), DC (decoupled stage).

Figure 12
figure 12

The trajectory of the right (connected) foot in S2 as a function of experimental stage. Stages: B1 (spontaneous baseline), B2 (uncoupled reactive baseline), CR1 (the first minute of the tethered stage), CR2 (~ 2 min. into the tethered stage), DC (decoupled stage).

Figure 13
figure 13

The HJD for Full-body joints over 10s in the middle of each stage for S2. The x-axis is the number of the 64 bins used in the modified spherical coordinate system. The y-axis reflects the percentage of time the joint occupied a particular bin. Stages: B1 (spontaneous baseline), B2 (uncoupled reactive baseline), CR1 (the first minute of the tethered stage), CR2 (~ 2 min. into the tethered stage), DC (decoupled stage).


Comparison of AI approaches: which AI approach works best and why?

We aimed to use machine and deep learning approaches to identify infant behavioural changes in response to changes in functional context. Both machine and deep learning techniques accurately classified infant Pose-Based features (HJDs) throughout the experiment. All techniques tested reached accuracies higher than chance (20%) for all joint types (Table 1), although deep techniques generally outperformed machine learning techniques for any given joint type (Table 1). In particular, 2D-CapsNet achieved the greatest accuracy level (accuracy for fused Feet= 86%). Thus, 2D-CapsNet maximised spatial information between different body parts during movement, demonstrating the architecture’s high-performance ability to generate feature hierarchy.

Comparison of joint classification across AI approaches: which joint has the most distinctive patterns and why?

For every architecture tested, whether of machine or deep learning, some measure of the feet achieved the highest classification accuracy rates (~ 20% higher than the fused hands, knees or whole body), indicating that the feet had the most distinctive pattern changes across the various experimental stages in response to the mobile’s changing behaviour and the infant’s functional connection to the mobile.

Do dynamics of classification accuracy reflect infant discovery?

The classification accuracies of several joints were assessed continuously over time to investigate how infants adapt to functional context and whether accuracy dynamics indicate that infants transition from exploring their relationship with the mobile to intentionally directing mobile movement. Again, classification accuracy was strongest by far for the feet, the end effector, compared to other body parts regardless of experimental stage (Fig. 10). Accuracy of the feet fluctuated around 80% during the spontaneous baseline, a lower level than when the mobile moved either at the hand of the experimenter during the second baseline or at the foot of the infant during coupling. These results taken together with the large spread seen in the first baseline histogram of one infant (each of Subject 2’s feet appear in 10 histogram bins, Fig. 13a) denote that infants’ movements are more widespread topologically and therefore harder to classify when the infant is moving spontaneously without any active environmental stimulation. On the other hand, once the experimenter began triggering the mobile, classification accuracy jumped up, hovering around 90% throughout the second baseline. This spike in classification accuracy is related to the concentration of foot activity within 3D space during the second baseline (e.g., each of Subject 2’s feet appear in only one or two histogram bins, Fig. 13b). Soon after infant~mobile coupling began, there was a momentary dip in accuracy (Fig. 10), possibly reflecting infants’ initial probing of a variety of trajectory orientations. However, accuracy levels spiked in the second minute of coupling (hovering again around 90%), likely reflecting infant discovery of trajectories particularly suited for triggering mobile motion (e.g., the switch to Z-orientation seen in individual infants as they raised and lowered their foot to trigger mobile motion, Figs. 11-12). These AI results converge on our dynamic and coordinative analyses which likewise indicate that the transition from initial spontaneous exploration to goal-directed action during infant mobile coupling can be characterized by initial expansion and later minimization of movement variability20. In sum, classification accuracy fluctuates within and across experimental stages and seems to reflect processes of exploration and discovery. However, what can we make of the fact that classification accuracy levels were similar when the experimenter triggered the mobile and during coupled interaction?

Since histograms of joint displacement were used as inputs to the AI architectures tested here, stark changes in overall quantity of limb activity from stage to stage are likely to be relevant to stage classifications. Accurately classifying foot behaviour based on average foot movement rate alone seems to be more straightforward during the second baseline (when infants were least active compared to the spontaneous baseline and the coupled stage as the latter two stages had similar movement rates (Table 2)). Therefore, although similar classification accuracy rates were obtained for the feet in the second baseline and during coupling, the high classification accuracy results during the second baseline are largely explained by the overall reduction in movement rate. Simply put, it is easy to identify data from the second baseline because the babies move much less. In contrast, the high classification accuracy during coupling indicates that foot movement changes topologically during coupling (i.e., from the first minute to the second minute) and across the stages. Our AI architectures use those qualitative changes (not merely quantity of movement) to classify behaviour. Though infant contingency studies traditionally infer learning when movement quantity is much greater during coupling compared to spontaneous baseline, our results also demonstrate that qualitative changes in infant movement (Figs. 9, 10, 11, 12, 13) can be detected even when average movement rates do not differ significantly. At the same time, it confirms that a qualitative change has taken place within coupling and across the experiment.

After the connection between infant and mobile was severed, activity spiked (Table 2) and accuracy quickly declined, plunging even lower than spontaneous baseline (Fig. 10). Infants appear to switch from directed movement during coupling to varied, exploratory movement after decoupling. As a group, they explore to a greater degree after being disconnected than before they were exposed to the possibility of controlling the mobile. It is as if the infants’ search for connection to the world has intensified because of the lost linkage. However, certain infants’ trajectories in the decoupled stage reflect a lasting imprint of coupling (e.g., S2, Fig. 12). It is possible that only some infants recognize their functional relationship to the mobile and so only these infants maintain the patterns which emerged during coupling after decoupling, expecting those unique patterns to elicit the mobile response. If the accuracy rate during decoupling persists at the same high level observed during coupling or dips below the lower accuracy rate seen during spontaneous movement, either pattern could indicate infant discovery during coupling. Using classification accuracy to identify infants who discover their ability to make the mobile move is complicated by the fact that different behaviours could indicate infant discovery for different reasons. Identifying infant discovery likely requires piecing together clues from various experimental stages. Therefore, temporal analysis of classification accuracy for each infant across the entire experiment (Fig.e 9) provides an important tool for understanding individualized processes of exploration and discovery. It is important to recognize that regarding the science of goal-directed action, the infant researcher is at a major disadvantage in comparison to the adult researcher for the simple reason that adults can speak and understand verbal instructions. One can ask an adult to execute a specific task or ask the adult participant what they intended to do. Given that machine and deep learning architectures detect qualitative changes in (prelinguistic) infant movement related to functional context and goal-directedness, AI analysis offers new levels of insight into behavioural~cognitive processes of preverbal infants.

Limitations and future directions

As in any formal comparison of machine and deep learning techniques, future research may explore the effects of other hyperparameter settings on classification accuracy rates. Although the histogram-based techniques used here condensed 3D movement data to a lower dimensionality by grouping data into bins, a comprehensive representation of the data was still preserved. Also, movement was represented using feature descriptors that are intended to simplify and extract useful information. As a result, rather than performing a frame-by-frame comparison, we were able to examine the distribution of the displacement of all joints across a time window. Future studies may incorporate the actual recorded video and skeletal joint information involving more infants and employ Recurrent Neural Networks (RNN) and more advanced Transformer and attention-based models to investigate the feasibility of predicting or decoding the temporal information of the input posture sequence. In addition, other advanced pose-based features can be used to evaluate the proposed network architectures58.

The results demonstrate accurate classification at the granularity of experimental stages but, as we have applied a windowing approach (5s wide and 1s overlap) we have demonstrated that we can classify these windows with an averaged accuracy of ~86% across infants (transferability of classifier to new unseen participants). This demonstrates that the proposed AI framework could be applied online, in real-time for labelling within experiments and using the knowledge provided by the classifier to adapt the experimental paradigm in real-time to investigate more complex neurodevelopmental processes.

A notable limitation of our study is that the current analyses were restricted to five infants. Increasing the sample size would enhance our present results. Nevertheless, it’s impressive that we can distinguish distinct changes in such a small sample. One possible path to recover the entire sample featured in the original experiment would be to use pose estimation on 2D videos to extract 3D positional data from the infants with incomplete motion capture datasets. The technology to accomplish this is in development. Also, a key goal for future work is directly connecting AI topological classifier dynamics (cf Figs. 9, 10) to signatures of emergent agency based on fine temporal resolution coordination dynamics20. It is an open question whether these AI techniques can provide further insight into the dynamics of agency emergence in individual infants.

Beyond unravelling the way infants discover to stabilize control over naturalistic interactions with the world, a basic issue of cognitive development (as well as artificial intelligence), applying AI to identify context-dependent changes in individual infant behaviour has obvious clinical relevance. Function~dysfunction is in many scenarios definitional to diagnosis. However, the General Movement Assessment, one of the most widely used tools for predicting and diagnosing motoric and cognitive disorders from spontaneous infant movement, does not address an infant’s functional movement abilities59,60. Given this disconnect, it is, therefore, no wonder that many recent AI efforts to classify and predict clinical dysfunction from infant behaviour have only considered spontaneous infant movement4,61,62,63,64,65,66,67,68,69 or that datasets made publicly available to train and test machine learning algorithms for classifying normative and atypical behaviour related to disease/disorder is predominantly composed of spontaneous behaviour data70,71,72. The baby~mobile paradigm offers an ethological means to assess individual infant motor and cognitive behaviour in both spontaneous and functional contexts73,74,75,76,77,78 and a potential means for personalized intervention79,80. We are developing a multi-pronged program of research built on synergizing theory, experiment, coordination dynamics modelling and analysis, active inference Bayesian modelling, and AI analysis aimed at understanding the emergence of sentient agency in lawful terms81 and advancing individualized, functional diagnosis and treatment capabilities for at-risk infants.


Both machine and deep learning techniques successfully classified randomly sampled five-second snippets of 3D infant movement from different experimental stages using histogram features of pose-based information. The deep learning 2D-CapsNet, however, achieved the highest accuracy classification rate. Critically, for every architecture tested, whether of machine or deep learning, some measure of the feet achieved the highest classification accuracy rates. Without informing these AI systems about the experimental design or the connectivity of the foot to the mobile, every architecture tested speaks the same message: the feet, the end effectors, are most uniquely affected by the baby~mobile interaction. Put another way, the functional connection to the world influences the baby most where it matters, namely at the point of infant~world connection. The information flow between agent and world is palpable; it is bodily felt and enacted. In contrast to traditional infant contingency studies which rely upon changes in movement quantity to infer learning, our AI classifiers offer a means to automatically detect characteristic topological changes across the experimental stages. Thus, AI systems can play a significant theoretical role, namely by providing insight into the early ability of infants to actively detect and engage in a functional relationship with the environment. Additionally, assessing dynamics of AI classification accuracy for each infant opens a new avenue for unravelling when and how individuals engage with and discover their relationship to the world. Whereas previous AI approaches have focused on classifying spontaneous infant movement in relation to clinical outcomes, pairing theory-driven experimentation with AI tools will ultimately allow us to develop more robust, context-dependent and functionally relevant assessments of infant behaviour for risk, diagnosis and treatment of disorders.