A hierarchical 3D-motion learning framework for animal spontaneous behavior mapping

Animal behavior usually has a hierarchical structure and dynamics. Therefore, to understand how the neural system coordinates with behaviors, neuroscientists need a quantitative description of the hierarchical dynamics of different behaviors. However, the recent end-to-end machine-learning-based methods for behavior analysis mostly focus on recognizing behavioral identities on a static timescale or based on limited observations. These approaches usually lose rich dynamic information on cross-scale behaviors. Here, inspired by the natural structure of animal behaviors, we address this challenge by proposing a parallel and multi-layered framework to learn the hierarchical dynamics and generate an objective metric to map the behavior into the feature space. In addition, we characterize the animal 3D kinematics with our low-cost and efficient multi-view 3D animal motion-capture system. Finally, we demonstrate that this framework can monitor spontaneous behavior and automatically identify the behavioral phenotypes of the transgenic animal disease model. The extensive experiment results suggest that our framework has a wide range of applications, including animal disease model phenotyping and the relationships modeling between the neural circuits and behavior.

1) Use the DLC to separately tracking the body features of four videos; 2) Perform the 3D skeleton reconstruction with 2D coordinates of four views.
Decompose NMs with the two-stage algorithm and obtain the behavior segment kernel matrix.
Merge and dimensionality reduction Unsupervised clustering Downstream analysis 1) Calculate the overall segment kernel matrix of all involved behavior segments. Black boxes mark the matrix of every single session on the diagonal; 2) Perform the dimensionality reduction of the group segment kernel matrix and obtain the 2D feature space of NMs. S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 1) Include the locomotion dimension and expand the behavior feature space to three dimensions to construct the behavioral map; 2) Perform the unsupervised hierarchical clustering to categorize the movement types; 3) Extract the ethograms of each behavioral experimental session.
Behavior sequences timing pattern analysis ...... Fig. 1 | The workflow of the hierarchical 3D-motion learning framework. a Four main steps for a single experimental session: 1) Calibration (related to Supplementary Methods "The calibration of 3D motion capture system"). Using the auto-calibration module to quickly prepare 70 groups of checkerboard images from various angles and positions for calibration and using the MATLAB StereoCameraCalibrator GUI to calculate the calibration parameters of the three pairs of cameras. This step is necessary only when the calibration parameters are unknown, or the cameras have been moved. 2) Data collection (related to Supplementary Methods Animals, behavioral experiments and behavioral data collection). Setting up the behavioral apparatus and preparing the animal and then using the multi-view video capture device to collect the synchronous behavioral videos. 3) 3D reconstruction (related to Supplementary Methods 3D pose reconstruction). Using the DLC pre-trained model to predict the animal's 16 body-part 2D coordinates from the four separate videos, then performing the 3D skeleton reconstruction with the 2D coordinates from four views to obtain the animal's postural time-series. 4) Behavior decomposition (related to Supplementary Methods Behavior decomposition). Performing the two-stage behavior decomposition on the pre-processed postural time-series. This step discovers the behavioral modules based on the optimal movement segmentation. Finally, these behavioral segments are aligned using the DTAK metric to construct the segment kernel matrix representing their similarity. b Group analysis based on specific biological questions. 1) Merge and dimensionality reduction (related to Supplementary Methods: Group segment kernel matrix and low dimensional embedding). According to experimental grouping, single session segment kernel matrices are merged into a group segment kernel matrix. To visualize the informative structure of the behavioral modules involved, we used dimensionality reduction to transform the gives a whole estimation of pixel errors. l-n extrinsic parameters visualization (EPV) of the cameras.

Supplementary
EPV shows the relative positions of two cameras and checkerboards in 3D space. If EPV is different from real relative positions, we should calibrate two cameras again until they are coincident. All of the checkerboards are fully captured in four different cameras placements with dark light. We use a 10-inch tablet to display checkerboard to calibrate four cameras, which could freely adjust the brightness and get appropriate brightness easily to keep the checkerboard clear enough. e-h the result of grid detection with MATLAB's StereoCameraCalibrator GUI corresponding to a-c and d images. The corner recognition of the checkerboard is located on the corner of the black squares. i-k mean reprojection of error per image (MREI). MREI gives the quantization level of calibration error between cameras and checkerboards, where the mean error in pixels (MEP) represents the pixel errors in each camera-checkerboard pair and overall mean error (OME) gives a whole estimation of pixel errors. l-n extrinsic parameters visualization (EPV) of the cameras. EPV shows the relative positions of two cameras and checkerboards in 3D space. If EPV is different from real relative positions, we should calibrate two cameras again until they are coincident.

Cameras P1
Cameras S1 Cameras S2 Cameras S3 The workflow of 3D reconstruction of a single body part: 1) estimate the two-dimensional coordinates of the animal's body part from four cameras; 2) select the cameras to be used for reconstruction by thresholding the likelihood of the estimated body part; 3) determine whether the number of cameras available meets the reconstruction requirements (2 or more); and 4) if 2 or more, reconstruct the 3D coordinate of the body part. Otherwise, the 3D reconstruction fails due to the occlusion. P1, primary camera. S1, first secondary camera. S2, second secondary camera. S3, third secondary camera. b The errors in 2D body-part estimations versus ground truth. The error rates are shown for each body part separately in the boxplot and averaged 0.534 ± 0.005%. In box plots, the lower and upper edges of the box are the 25th and 75th percentiles of the error rates, the central marks indicate the median, the whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted individually using the dot symbol.
Body parts names

Proportion of number of available cameras for each body part
Cameras P1 Cameras S1 Cameras S2 Cameras S3 The color-coded dots indicate how many cameras are used for reconstruction at the indicated location.
The positions of all the points are the x and y coordinates of the nose. For visualization purposes, the data are down-sampled to 10%. The proportional number of cameras used to reconstruct the nose in 3D is indicated in the key on the left. involved test mice are described, whose SNRs are more than 0.996, representing that there are averaged only four noise points in a thousand data points needed to be eliminated. Abbreviation: SNR. Signalto-noise ratio.  Supplementary Fig. 7 | The procedure of data preprocessing. a The procedure of data quality control contains two main parts, a noise suppression sequence part ordered by five filtering modules and the SNR detection part. b The selected multi-dimensional time series of mouse skeleton are preprocessed by a data quality control procedure. The red arrows indicate that the apparent noises are suppressed, and the whole series doesn't miss too much rapidly changing details. c Eight skeleton frames are randomly selected for a demo of mice alignment. The results of center orientation alignment are demonstrated on the top view and side view. d The multi-dimensional time-series data quality of 12 involved test mice are described, whose SNRs are more than 0.996, representing that there are averaged only four noise points in a thousand data points needed to be eliminated. Abbreviation: SNR. Signal-to-noise ratio.

Calibration of the 3D motion capture system
We used MATLAB's StereoCameraCalibrator GUI to calibrate the cameras' 3D motion capture system. The GUI uses the basic principles of Zhang's calibration method 1 . Furthermore, we prepared checkerboard pictures to obtain calibration parameters.  Fig. 3a-d). To improve the efficiency of the checkerboard capturing step, we updated the 3D motion capture system. Thus, the new version uses a 32-inch displayer to show the checkerboards. Compared with the old version, the new version integrates the code of the checkerboard display, checkerboard moving, and four-camera checkerboard capturing, by using the 3D motion capture system to automatically capture the checkerboard images, thus not requiring manual operation (see Supplementary Fig. 2).
After capturing the checkerboard images, we estimated the camera parameters using the following equation, assuming that X, Y, and Z were the 3D coordinate points of the stereo cameras: where P is a scaling constant, u and v are the pixel coordinates of the checkerboard image,  is the focal length divided by the length of the picture,  is the focal length divided by the width of the picture,  is the radial distortion parameter, 0 u and 0 v are midpoints in pixel coordinates of the image, R is a rotation matrix, and t is the translation vector. Zhang's calibration method can obtain ,  ,  , R , t , and P . Each camera acquired 70 calibration photos, summing to a total of 280 calibration photos by the four cameras ( Supplementary Fig. 3). Moreover, we measured and recorded the grid's checkerboard size. We set one of the cameras as the primary camera and the remaining three cameras as sub-cameras. The cameras were divided into three groups, each containing a secondary camera and the primary camera (see Supplementary Fig. 4). In each group, we used StereoCameraCalibrator GUI to calibrate the two cameras and obtain three calibrated files.

3D pose reconstruction
We used pose3d toolbox in MATLAB to perform the 3D reconstruction of the animal skeleton.
This toolbox uses the triangulation algorithm to obtain 3D data, according to the following equation: x is the coordinate of the matching point in the i camera image; X is a vector of 3D point; I , J , K are the 3D points of X ; i P is the projection matrix; i A is the internal reference of the i th camera; i R is the rotation matrix of the i th camera; and i t is the i th translation vector of the i th camera. In order to obtain X , we used the solution of the least square method, obtained by singular value decomposition 3 . Specifically, three calibrated files and four feature files were used as input to the system of pose3d, and the system produced one 3D feature file as output. After this step, we rotated and translated the 3D coordinates to a horizontal direction for visualization and subsequent processing. To show the 3D reconstruction effect, we used the Delaunay triangulation 4 to fill the skeleton.

Data quality control
Data quality control ( Supplementary Fig. 7a) was performed to ensure the precision of follow-up algorithms. It included two steps: data noise suppression and data quality assessment.

Data noise suppression:
We built a suppression algorithm sequence of data noise, including a likelihood method, morphological noise detection, median filter 5 , local median filter, and adaptive median filter. These algorithms were ordered as needed. The likelihood method involved a noise detection step. The likelihood of raw data in DLC was assessed, and a threshold was set to distinguish the noise and non-noise level likelihoods. We set the threshold to 95%; when the likelihood was smaller than 95%, the noise data points were detected. Morphological noise detection was used to identify a leap of body points at the physical morphological level. The most precise pose estimation frame was selected automatically by the maximum likelihood frame as a template. The maximum body points distance of the template multiplied by a coefficient was set as a threshold to detect morphological saltation of animal skeletons by comparing the current maximum body points distance with the threshold. Next, the distance between two points, higher than the threshold, was marked as noise. The median filter is a widely used method for impulse noise. It replaces the impulse noise points by median values in a time window, which can effectively eliminate the impulse noise while keeping the highfrequency details of data. Most noise data points in pose estimation can be seen as impulse noise; thus, the median filter was suitable to suppress them. The local median filter was used as an interpolation algorithm in our framework. We placed the median filter window across the noise data points detected by the likelihood method and morphological filter, and then replaced the noise data points by median data in the filter window. Since this filter only focuses on noise location, it does not influence the qualified data. The adaptive median filter is designed to substitute the median filter. The median filter has a fixed window width, thus treating all data at the same time scale. The adaptive median filter adopts a variable window width; thus, the impulse noise in different time scales can be disposed of more properly. We provide a default parameter of noise suppression algorithms sequence for preprocessing our 3D animal pose estimation data (likelihood method, morphological noise detection, and median filter with 1-s time window).
Data quality assessment ( Supplementary Fig. 7d): We calculated the multi-dimensional time series total energy of raw data, preprocessed data, and noise defined as the absolute value of the subtraction of raw data and preprocessed data. Next, we used these energies to assign a signal-to-noise ratio (SNR).
The division of preprocessed data was used to calculate the SNR's total energy to the raw data total energy.

Mouse alignment
Mouse alignment included four steps ( Supplementary Fig. 7c), and all alignments were carried out in a 3D Cartesian coordinate system: Step 1-Center alignment: We considered the back point as a center point and subtracted this value from the value of each body part in plane { , } X Y , so that each frame of mouse skeleton would have the same center coordinate value (0, 0) after center alignment. Step

Behavioral decomposition
Behavioral decomposition was applied to non-locomotor movements (NM). First, we set , where we used the Gaussian kernel 8 , as follows: We fixed  to 30 to keep each U from different videos having the same Gaussian kernel and mapping to the same high-dimensional space.
Fourth, NMs were decomposed via aligned cluster analysis (ACA) 9,10 according to K . ACA involved two steps: initialization and optimization. We used spectral clustering (SC) 11 for initialization.
We set as the data points of NMs after alignment, spatial reduction, and temporal reduction. The optimal DTAK (Supplementary Fig. 7) distance matrix T could be calculated as follows: is the DTAK distance between two NMs dynamic i s X and j s X .
Next, T was reduced from P dimensions to 2 dimensions by UMAP, which enabled large amounts of data to be more widely distributed. We set

Unsupervised clustering
Hierarchical clustering 13,14 was used to cluster 3D embedding. The pairwise distance matrix D between z E was calculated using a standardized Euclidean distance 15 . Next, we used the inner squared distance 16 to obtain the linkage of D . The number of clusters was set to 11 and 41 in single-and multivideos embedding clustering, respectively, which was used to cut the hierarchical cluster tree. Hence, we obtained the 3D embedding label and the movement space for the further analysis of mouse movements.

Clustering quality index (CQI)
Supplementary Information -A Hierarchical 3D-motion Learning Framework for Animal Spontaneous Behavior Mapping

Supplementary Information Page 24
To quantify the quality of clustering, we defined CQI (Supplementary Fig. 10). Setting the DTAK distance matrix as Lastly, we normalized Φ to [0,1] by a sigmoid function to reduce the dynamic range of Φ and calculated the CQI as follows: CQI=sigmoid( ) Φ .

Clustering quality correlation analysis
To compare the clustering similarity of inter and intra movement phenotype clusters simultaneously, we applied a correlation analysis with a linear regression 17 . The feature vector of each behavioral module ( 1 n  vector) in the DTAK distance matrix ( n n  ) represented the distance between each behavioral module, such that all dimensions of the feature vector would have the same unit to be compared. If two behavioral modules show a higher similarity, the correlation of their feature vector should also be similar. As such, we chose one behavioral module's feature vector as the reference and randomly selected the feature vector of other behavioral modules as the targets for comparison.
Each target with the reference formed a 2 n  list, which was plotted on the two-dimensional space for Supplementary Information Page 25 linear regression and correlation analysis. The higher the R value of the linear regression, the higher the correlation of the two behavioral modules.
For inter clustering, we sequentially chose the behavioral module as reference and randomly selected three other behavioral modules' feature vectors in the same cluster as targets to plot in the twodimensional space until all references were plotted with targets. Next, we performed linear regression.
For intra clustering, we sequentially chose the behavioral module as reference and randomly selected three other behavioral modules' feature vector out of the reference's cluster. We then plotted them in the two-dimensional space in gray points until all references were plotted with targets. Similarly, we performed linear regression. Notably, only the regression lines of the inter clusters are plotted in the figure 5b.

Mapping of new behavioral data to UMAP template
To compare the behavioral patterns of Shank3B mutant mice under different conditions, the 84 sets of behavior data were mapped into the same feature space. Processing such big data is often limited by computing resources. Therefore, we used the existing UMAP space that was constructed using the behavioral modules of 6 Shank3B +/+ and 6 Shank3B -/mice (data collected in an earlier experiment) as a template, and then used the UMAP toolbox in MATLAB and applied the default parameters to map the 84 behavioral datasets to the template 7,18 . Following the mapping, the categories of new behavioral modules were determined using K-Nearest-Neighbor to assign categories for new behavioral modules in low-dimensional embedding.

Moving intensity (MI)
To quantify the movement of each body part, we defined the MI. We considered each body part as a particle, which means that its moving intensity was correlated to velocity and accumulated over time. Further, the intensity of each movement segment was the summation across time, similar to the concept of energy. Therefore, we defined MI similar to kinetic energy in physics, which is the energy possessed by an object during its motion, as follows: where k E is the kinetic energy, m is the mass of an object, and v is the velocity of an object.
Considering that the body parts have unit mass and simplifying the calculation, we defined 1 2 p m m = as the mass of body parts, with the value of p m being 1. Next, the kinetic energy was transformed to MI, and the MI of a movement in different coordinates was calculated as follows: where xy E is the mean MI in the XY coordinates plane, yz E is the mean MI in the YZ coordinates plane, all E is the mean MI in the XYZ coordinates space, N is the dimension of movements, V is the velocity matrix of different movements, and T is the temporal length of V .
After each MI of a body part in XY (horizontal) and YZ (vertical) plane coordinates was calculated, it was visualized in Fig This step aimed to assess the position-correlated MI of the body parts describable by the MI value.
Finally, the weighted MIs were plotted as a heat map to observe the movement areas of each body part and depict the movement intensity in specific positions.

Traditional behavioral analysis
To quantify mouse behaviors in the circular open field test, we first plotted the velocity-trajectory maps for behavioral visualization and then calculated the mean velocity, maximum velocity, locomotion velocity 19 , and mean anxiety index 20 of Shank3B -/and Shank3B +/+ mice ( Supplementary Fig. 11). We , ,..., is the velocity vector.
The locomotion velocity l V was calculated as follows: where V thres is the threshold of i V to select the locomotion velocity from i V set as 190, and M is the The mean anxiety index represented the anxious degree of mice and was calculated as follows:

State transitions analysis
For the behavioral analysis of continuous long-term monitoring, we applied a state transition analysis for the behavioral modules ( Supplementary Fig. 13

Supplementary Tables
Supplementary Table 1 Definition of the behaviors (ethogram) for manual labeling Behavior Definition

Running
The mouse locomotes with relatively high speed.

Trotting
The mouse locomotes at a slow and intermittent pace. Stepping The mouse takes a step forward with a short distance locomotion.

Diving
The mouse bends down its body from standing state then stretches forward.

Sniffing
The mouse investigates environment with the nose held in the air or contacts the environment with nose closely.

Rising
The mouse rises from four legs on the ground to steadily stand on its hind legs.

Right turning
The mouse bends its body to right or turns body to right while walking.

Up stretching
The mouse stands on its hind legs while the front part of body stretches back and forth.

Falling
The mouse takes its croup as pivot and gets down with straight back from standing on its hind legs.

Left turning
The mouse bends its body to left or turns body to left while walking.

Walking
The mouse locomotes with relatively low speed.

Rearing
The mouse stands on its hind legs, and the back is straight.

Hunching
The mouse stands on its hind legs while the back is bent.

Self-grooming
The mouse licks its fur, grooms with the forepaws, or scratches with any limb.  In all the 100 times paired-wise comparisons, the number of times that KO individuals with higher X behavior modulus fractions than WT is m.