Multi-level modeling with nonlinear movement metrics to classify self-injurious behaviors in autism spectrum disorder

Self-injurious behavior (SIB) is among the most dangerous concerns in autism spectrum disorder (ASD), often requiring detailed and tedious management methods. Sensor-based behavioral monitoring could address the limitations of these methods, though the complex problem of classifying variable behavior should be addressed first. We aimed to address this need by developing a group-level model accounting for individual variability and potential nonlinear trends in SIB, as a secondary analysis of existing data. Ten participants with ASD and SIB engaged in free play while wearing accelerometers. Movement data were collected from > 200 episodes and 18 different types of SIB. Frequency domain and linear movement variability measures of acceleration signals were extracted to capture differences in behaviors, and metrics of nonlinear movement variability were used to quantify the complexity of SIB. The multi-level logistic regression model, comprising of 12 principal components, explained > 65% of the variance, and classified SIB with > 75% accuracy. Our findings imply that frequency-domain and movement variability metrics can effectively predict SIB. Our modeling approach yielded superior accuracy than commonly used classifiers (~ 75 vs. ~ 64% accuracy) and had superior performance compared to prior reports (~ 75 vs. ~ 69% accuracy) This work provides an approach to generating an accurate and interpretable group-level model for SIB identification, and further supports the feasibility of developing a real-time SIB monitoring system.

Dynamical systems analyses separate variability in a movement process into chaotic vs. deterministic variability components 36,38 . On the other hand, non-linear measures based in chaos theory and dynamical systems analysis consider the evolution of changes in a system over time.
Prior work classifying SMM in ASD has used time-and frequency-domain features 25 , or has focused on relatively simple measures of variability such as the standard deviation and variance, with the latter yielding frequent false positives 27 . Nonlinear movement variability features could improve classification model performance by capturing the underlying variability in SIB movements 39 , even if the SIB changes between episodes. Dynamical systems theory may be relevant to SIB, since nonlinear components were found in the temporal patterns of SIB, though differing within and between individuals 40 . More complex temporal patterns also emerged in the presence of SIB 40 , and recent work suggests that movements become increasingly complex as a child with ASD transitions to an episode of SIB 41 . This complexity can be captured by measuring nonlinear variability in the movements of individuals with ASD and SIB, as further explained below in the description of nonlinear motor variability metrics.
Prior work also has found that nonlinear measures, such as entropy, are indicative of diagnosis when applied to motor control in ASD. For example, children with ASD had decreased dynamical complexity during quiet stance compared to typically-developing children 42,43 , although people with stereotypy showed greater linear variability (standard deviation) during postural sway 43 . Given that they can distinguish between neurotypical and pathological movements, these methods could capture changes in pathology within an individual (i.e., detecting health changes such as early signs of aggression). Variability has also been associated with other pathological behaviors 35,36,44 and the progression of health conditions 45,46 , and thus could reflect changing risk of SIB in ASD as well. To our knowledge, though, only one study employed a nonlinear approach (recurrence quantification analysis, described below) to classify motion in ASD, and found that the additional nonlinear features of movement variability improved classification accuracy by 5-9% 39 . Although the analysis in Großekathöfer et al. 39 was performed on SMM, their results could generalize to SIB, which is similarly repetitive and rhythmic.
In summary, SIB is one of the most dangerous behaviors in ASD, and a monitoring system could address the limitations of traditional tracking methods. Predictive modeling with features capturing nonlinear motor variability has the potential to provide superior performance vs. more traditional methods used in related ASD research. However, sensor-based behavioral monitoring for SIB has not yet been explored. A long-term goal of our research is to develop a real-time SIB monitoring system that can collect continuous movement data, alert the caregiver before SIB onset, and assist in management methods (e.g., redirecting the individual with SIB towards a different task). To this end, we aimed in the present study to develop an interpretable and generalizable model to classify a variety of behaviors among a range of participants, specifically by: 1. Utilizing dynamical systems theory to extract measures of nonlinear motor variability as features in an SIB prediction model 2. Building a multilevel logistic regression model with variable intercepts and slopes to account for interindividual variability Materials and methods participants. Data used here were obtained in Cantin-Garside et al. 34 and are briefly summarized below, with additional information in Supplementary Material. Children with SIB and ASD were recruited through the university-affiliated psychology clinic and through the authors' networks. Caregivers were pre-screened to confirm inclusion criteria: (1) children aged 5-14 years, reflecting heightened aggression in childhood 47,48 ; (2) diagnosis of ASD; (3) SIB episodes > 3/hour, to ensure multiple episodes during the 1-3 h sessions; (4) fluency in English; and (5) home within driving distance of the noted Center. Note that the last inclusion criterion required non-representative convenience sampling. All adult participants provided informed consent, and qualifying children (> 7 years of age and of developmental level) provided assent before any data collection. Caregivers provided informed consent for their children who were younger than 18 years of age. The Virginia Tech Institutional Review Board (IRB) reviewed and approved all experimental procedures. All experimental methods were completed in accordance with relevant guidelines and regulations. Eleven participants (5-14 years, M = 9.5, SD = 3.0) and their caregivers completed the study. Sessions lasted 35-147 min, providing more than 1000 min of data and > 200 episodes of SIB. Ten of the 11 participants exhibited SIB (participants 1-4 and 6-11, denoted as "P#") with 18 different types (Table 1). All participants wore the wrist sensor, and the limited sensor configurations of P1, P8, and P9 precluded the use of other sensors in the group-level model. To include all participants in one group-level model, only the wrist sensor was considered. In contrast, data from 2 to 6 tri-axial accelerometers (Table 1) were used in individual-level models.
Study overview. After obtaining consent, the lead author, or the caregiver guided by the author, secured sensors on the child where tolerated. Demographic information was obtained, including potential SIB triggers identified by the caregiver. A trained clinical psychology doctoral candidate confirmed ASD diagnosis using standard tools (i.e., ADOS-2) 49 . The examiner was research reliable in administration and scoring of the ADOS-2, and was supervised by a licensed clinical psychologist who was also reliable (SWW). Subsequently, movement sensors (see "Instrumentation"), video cameras, and 2-3 observing researchers monitored each child during free-play. Researchers instructed the caregivers to respond to SIB as if at home. If sufficient SIB episodes failed to occur during free play, caregivers had the option to prompt SIB in a controlled fashion with a commonly-used procedure (Standardized Observation Analogue Procedure, SOAP) 50 . The session ended when either: (a) > 3 episodes of SIB were observed, or (b) participants or researchers stopped the session to prevent escalating behav- instrumentation. Tri-axial accelerometers (ActiGraph GT9X Link, www.actig raphc orp.com) were used to track participant movement (sampling frequency = 60 Hz) throughout the session. Earlier work found that these particular sensors were both reliable and accurate when used with children and adolescents 51 for tracking movement among both pathological and healthy populations 52 . A maximum of six sensors were placed on/in the wrists, waist, pockets, and ankles as accepted by the participant. Sensor choice and placement reflected prior research that found high reliability and high accuracy when classifying activities with movement sensors on either the wrist or torso 25,52,53 . Ankle sensors were also included as potentially necessary to capture lower-body injurious behaviors 18 . Three Go-Pro cameras and an overhead camera recorded videos for each child as "ground truth". Wrist, waist, pockets, ankles Self-hitting (10) 80 Pulling teeth (11)  33 Eye-gouging (jabbing eye with hand) (12) 79 Jabbing pelvic region (13)  16 Jabbing throat -location of prior tracheotomy (14)  46 Hitting chin/jaw with heel of hand (15)  Hands to surface (2) 9 Wrist, waist, pockets, ankles Finger Picking (4) 9 Scratching (5) 2 Head to wall (8) 20 Self-biting (9) 39 Self-hitting (10) 4 Eye-gauging (12) 58 Pulling ear (22)  www.nature.com/scientificreports/ Data processing and analysis. Sensor data were exported into MatLab (R2018a, MathWorks), which was used for data analysis and modeling (using an Intel, dual-core, 2.9 GHz CPU). Accelerometer data were labeled as non-SIB events (0) or SIB (1) using the ground truth video data and annotations from in-session observations. Before the session began, members of our research team discussed the SIB that parents described during pre-screening. Behaviors were defined by watching the children individually and captured by terms provided by their caregivers (e.g., eye-gouging). Members of our team also discussed behaviors observed in sessions, both during and after the session, for consensus building. Behavioral definitions were further clarified prior to data labeling. Multiple researchers annotated and discussed the video data before labeling raw accelerometry files (see Fig. 1 for an overview of the modeling process, and Cantin-Garside et al. 34 for details on consensus-building for SIB labels) 34 . Raw sensor data were filtered using a 4th order, low-pass, recursive Butterworth filter, with a cutoff frequency of 20 Hz. Filtered data were used to obtain time-and frequency-domain features. Based on prior work 54,55 , raw data were used for extracting features of nonlinear motor variability (see "Derivation of nonlinear metrics of motor variability"). For continuous analysis of discrete data, all data were segmented into 2-s sliding windows with a 1-s overlap 7,27 . This short time window was used to minimize delays, which was considered important for real-time monitoring and reflects the potential for relatively short "bursts" of SIB. feature extraction. Three sets of features were extracted: (1) features in the time-domain; (2) features in the frequency-domain; and (3) nonlinear metrics based on chaos theory and dynamical systems. As discussed above, the use of time-and frequency-domain features is supported by prior findings on classifying SMM 7,18,22,27 , with nonlinear motor variability features included to capture the dynamical complexity of motion and to improve classifier performance 56,57 . Table 2 lists the features extracted for each channel. The presence (1) or absence (0) of a prompt to instigate SIB (from SOAP) was also included initially during feature selection (see "Feature selection"); all caregivers except for Participant 4 opted to use SOAP at least one time during their session.
Derivation of nonlinear metrics of motor variability. Nonlinear metrics of motor variability were extracted by first reconstructing the phase space of the raw sensor data 36,58 . Phase space represents the states ("state space") of dynamical system behavior in a plot, and this reconstruction involves creating M copies of the original time series x , where M is the embedding dimension, using a time delay (τ ) 36,59 . Time delay was determined using two methods: (1) the first minimum of the average mutual information function 60,61 ; (2) the delay when the time series autocorrelation was less than e -156, 62 . Time delay was determined using both methods separately for SIB and non-SIB events (see 36,56,59 for additional details). The selected τ was the value for which both methods converged, and was similar for both SIB and non-SIB ( τ = 5) . The resulting embedding dimension (M) was 4, derived from the global false nearest neighbor analysis method 61,63 . Both τ and M values here are similar to prior work on the nonlinear variability of human motion 61 . State space was reconstructed as embedding vectors X(t) in the form:  www.nature.com/scientificreports/ such that t = 1, . . . , N − (M − 1) . Delay reconstruction was used to create the phase space, which produced consistent results in other work 61 . The parameters described below were then calculated using the reconstructed phase space.
Entropy. Sample entropy (SaEn) was used to quantify the complexity of the acceleration signals, with low values indicating low complexity 56,64,65 . SaEn was calculated using the following steps: Cross-sample entropy. Cross-sample entropy has not been explored in models classifying ASD motion, though was determined here between each sensor channel and used in our feature set. Cross-sample entropy parallels SaEn in its estimation, but examines the difference between one data stream and another data stream 56,66 . Lower values imply similarity and synchronicity between the two data streams 64 .
Recurrence quantification analysis. Recurrence quantification analysis (RQA) was performed with a MatLab toolbox 67,68 to evaluate phase space predictability and intermittency 56,69 . An RQA map is first constructed through a distance matrix comparison. A distance matrix (DM) consists of elements ( DM ij ) that are Euclidian distances ( DM ij = d[X(i), X j ] ) between embedding vectors X(i) and X j 56 . DM ij elements are then compared against a threshold determined by recurring dynamical trajectories, with elements = 1 for DM ij < threshold, indicating recurrent points returned to a previous location, and = 0 otherwise 39,56 . The selected threshold guarantees that the percentage of recurrent points remains within 0.1-2% of the total recurrent elements 56 . RQA can be evaluated using several measures 56,69,70 , with the following selected as reliable for human subject research 56,66,69-71 : 1. Recurrence-regularity of the time series as the percentage of recurrent points 2. Determinism-percentage of consecutive, diagonally-aligned recurrent points indicating signal periodicity and predictability; this relates to the inverse of the largest positive Lyapunov exponent, because longer diagonals imply deterministic versus chaotic movements 3. Laminarity-percentage of vertically aligned recurrent points, indicating signal stability (similar to determinism) 4. Divergence-inverse of the maximum diagonal line segment, related to the maximal Lyapunov exponent 5. Maximum diagonal length-proportional to the inverse of the maximal Lyapunov exponent, indicating the longest duration of periodicity 6. Trapping time-mean vertical line length, indicating the duration of the trapped state, reflecting signal constancy Detrended fluctuation analysis. Detrended fluctuation analysis (DFA) was used to quantify the persistence of SIB movements. DFA exponents (α) were calculated for every time segment to assess long-range correlations 36,72 , with persistence indicated by α > 0.5 for time series deviations that continue in the same direction, and antipersistence by α < 0.5 for deviations that continue in the opposite direction 72 . DFA has been used in analyses of motor control for ASD, with evidence of long-range correlations (persistence) during a drawing task 73 . . DFA has also been applied to capture the predictability of a movement, specifically when walking 72,74 . Persistence typically degrades in pathology. The underlying long-range correlations are altered in disordered movement, compared to consistent correlations in healthy movement (Goldberger et al. 75 ). This finding remains evident irrespective of whether the outward appearance of pathological behavior appears more restricted or more chaotic 75 . feature selection. The least absolute shrinkage and selection operator (lasso) method was used to address multicollinearity, to remove redundant features, and to determine the sparsest model when considering all 96 features and all sensors 76 . This method directly selects variables that most contribute to the model. Principal component analysis (PCA) was then used for dimensional reduction, by finding the optimal combination of the 59 selected variables (see Supplementary Materials, Table S2 for further information on variable selection and loadings) 77 . This feature selection approach is capable of characterizing data despite high variability between participants. PCA output was subsequently used as input to a multilevel logistic regression model (MLR).

Regression modeling. A multilevel logistic regression (MLR) model was created with variable slopes and
intercepts. The latter were used to account for high inter-subject variability 32 , which could be particularly relevant for ASD. Further, this model relied upon data from other individuals when classifying an episode of SIB to improve the overall performance of the entire group. This aspect of the model is particularly relevant, as SIB episodes can be sparse, depending on the individual and type of SIB. Specifically, the model can be written as: where U j and V j,k are potential features, with corresponding linear coefficients γ ′ j specific to the individual level, and modeling error variances σ 2 .
evaluation. Data were balanced and randomly selected following a 8:2 training/validation:testing ratio, so as to build a robust model and to test the built model 78 . SIB events are relatively rare compared to non-SIB events, which leads to a skewed distribution as found in prior work with ASD-related behaviors 7,39,79 . SIBs here lasted for about two seconds at minimum, though more subtle movements, such as picking, lasted longer, which lasted ~ 10 to 90 s. SIB and non-SIB data were balanced as in other work to address skewness 28,39,[79][80][81] . Balanced data were used for training, and tenfold cross-validation was used 18,26,30 . This validation method was implemented to reflect the likely use cases in SIB interventions, including training and validating a model using data from each unique individual. SIB management is highly specific to the individual and the demonstrated behaviors, thus requiring representative data. Movement classification for SIB must therefore reflect the need for highly customizable tracking methods and account for heterogeneity. Two datasets were used for testing model generalization: (1) balanced data; and (2) natural, unbalanced data reflecting the ratio of SIB:non-SIB in the complete dataset. These testing methods were used to examine the potential use of a model pre-trained on controlled data for application to natural, unbalanced datasets. All data were randomly selected from across the duration of a given session, and observations were assumed to reflect the entire dataset 81 .
Outcome measures were calculated for each model (MLR, and the models described below), with classification performance (accuracy, specificity, precision, recall, and F-score) calculated for each classification method 82 . Training and testing time were also computed for all developed models to assess the potential for application to real-time monitoring.
Model comparisons. Additional group-level models were trained, validated, and tested, for the purpose of comparison with the MLR model ("MLR -variable intercepts and slopes"). These additional models were of five different types: 1. Logistic regression (LR) with variable intercept only ("LR-variable intercept"). This was used to compare the MLR with a less complex model, while still accounting for participant variability. 2. LR without variable slopes or intercepts ("LR-no variable terms"). This model was included to compare the MLR with a model that does not consider participant-level variation. 3. Two-way interaction model (stepwise LR), with included terms determined by BIC ("LR-stepwise"). This model was included to compare the MLR with higher-order, nonlinear models, and to evaluate the effect on accuracy when including terms with lower interpretability but potentially higher predictive power. 4. Participant-level LR models ("LR-ind"), one for each of the 10 who exhibited SIB. These models were included to compare the group-level MLR with highly specific modeling that may have low generalizability, yet high accuracy. 5. Several models using machine learning methods: k-nearest neighbors ("kNN"), with k = 11 selected through optimization; support vector machines ("SVM"), and decision trees ("DT"). These three models were included to compare the MLR with previously-employed methods demonstrating strong individual (though not group-level) performance in other ASD applications and our previous featureless work (Cantin-Garside et al. 34 ), and to compare the MLR with "black-box" models with lower interpretability but typically high predictive power.

Results
Dimensional reduction. Fifty-nine variables were selected from lasso and were then input in PCA for the (2) ,k X k , i = 0, 1, . . . I; j = 1, 2, . . . , J; k = 1, 2, . . . , K www.nature.com/scientificreports/ Table 4 summarizes results using the MLR. PC1 and PC12 were both significant features when considered across participants, though not when varying with participant level. The intercept, along with PCs 2, 5, 6, 7 and 9, were only significant features when considering participant levels, and not when fixed. PCs 3, 8, 10, and 11 were significant features both when fixed and when randomly varying with participant level. PC4 was not significant in the model, either when fixed or when varying with participant level. Tables 5 and 6 respectively summarize results regarding training time, accuracy, specificity, precision, recall, F-scores, and adjusted R 2 for validation and testing of group-level models. Training times for MLR, stepwise LR, and cubic SVM were 2-4 times longer than for other classifiers (10 -2 vs. 10 -6 s/ observation), though this same difference was not reflected in testing times (all times within 10 -6 -10 -5 s/prediction). MLR had high accuracy (74.7%) and F-score (0.752) in validation, which decreased minimally when testing with balanced data (73.2% and 0.733). Accuracy and F-score decreased with unbalanced test data for MLR (69.1% and 0.184). Specificity, precision, and recall were all highest for MLR in validation (~ 0.73 to 0.77) and testing (~ 0.73 for all three measures). Adjusted R 2 was the highest for MLR (0.502) and the lowest for LR without variable intercepts/slopes (0.106). When considering participant levels with only a variable intercept versus both variable intercept and slopes, most performance metrics decreased by ~ 2 to 5%. LR without variable intercepts/slopes had the lowest accuracy (64.0%) in validation, dropping to 47.0 and 56.1% for balanced and unbalanced data, respectively. Linear SVM had the lowest specificity and precision (0.599 and 0.631), while LR without variable intercept/slopes had the lowest recall (0.663), though this trend did not extend to testing results.

Classifier performance.
Stepwise LR had the lowest specificity for both balanced and unbalanced test data (0.526 and 0.552, respectively). LR without variable intercepts/slopes had the lowest precision, recall, and F-score for balanced test data (0.455,  www.nature.com/scientificreports/ 0.304, and 0.365, respectively). The kNN classifier had the lowest precision, recall, and F-score for unbalanced test data (0.064, 0.591, and 0.116, respectively), while LR without variable intercept/slopes had the lowest F-score for validation (0.648) and balanced test data (0.365). Tables 7 and 8 show the classifier performance for validating and testing individual models, respectively. Time was on the order of 10 -5 -10 -3 s/observation for training and 10 -4 -10 -3 s/prediction for testing. Validation accuracy was higher overall (70.7-97.9%) when compared to group-level models (64.0-74.7%). Testing accuracy ranged widely, from 50.0-83.3% for balanced data to 27.2-95.7% for unbalanced data. Note that the unbalanced dataset, relative to the balanced dataset, did not lead to a substantial decrement in accuracy from validation to test set because there are substantially more behaviors labeled as non-SIB than SIB, and hence the label of non-SIB is easier to predict. Specificity ranged from 0.648-1 for validation datasets, and from 0.500-1 for balanced and unbalanced test data. Precision was between 0.692-1 for validation data, 0-0.818 for balanced test data, and 0-0.5 for unbalanced test data. The ranges of recall values were 0.746-1 for validation data and 0-1 for both balanced and unbalanced test data. F-scores ranged from 0.718-0.979 for validation data, 0-0.857 for balanced test data, and 0-0.667 for unbalanced data.

Discussion
We developed an interpretable model to identify diverse types of SIB among a range of participants. Traditional time-and frequency-domain features were used, along with features capturing nonlinear motor variability, as input to a multi-level logistic regression model capable of detecting SIB at the group level, with selected components from dimension reduction explaining > 65% of the data variance. The lasso method did not select the prompt variable for this group-level model (recall that this prompt represented the presence or absence of caregiver actions that were targeted at instigating SIB), indicating that this model explains the presence of SIB www.nature.com/scientificreports/ beyond an identified SIB trigger. This finding is consistent with a prior report 40 that temporal patterns in SIB occur independent of behavioral or environmental influences (or "triggers"). Nonlinear motor variability features (e.g., from RQA and entropy) loaded on PCs that accounted for > 10% of the explained variance in the dataset. DFA features had moderate loadings on PCs, and these features were a novel addition to modeling for ASD. Descriptive statistics of metrics of nonlinear motor variability from pooled SIB versus non-SIB events across participants showed little difference between the behavioral classes (Supplementary Materials, Table S1), which may explain the poor performance of a general group-level model without participant levels. However, upon examining one of the most severe behaviors (head banging) in one participant, nonlinear motor variability of SIB differed from non-SIB events (Supplementary Materials, Table S3). For example, DFA exponents for both non-SIB and SIB events in the Y and Z axes were slightly anti-persistent (< 0.5), indicating changes evolving in different directions over time. Though exponents remained < 0.5, there was a slight increase in DFA exponents for the Y and Z axes for SIB events compared to non-SIB, indicating more persistence in SIB events. Differences among other nonlinear metrics were evident for head banging in P9. Recurrence rate, for example, decreased for head banging in this individual compared to non-SIB events, indicating less regularity in SIB data. This finding opposes the common perception that repetitive behaviors are "regular". There was a slight decrease in Table 7. Validation results for individual participants. Note that 1a = first part of P1 session with only upper body sensors, and 1b = second part of P1 session with additional lower body sensors. www.nature.com/scientificreports/ sample entropy for SIB in the Y and Z axes compared to non-SIB events, suggesting lower levels of complexity; however, cross-sample entropy increased during SIB, indicating higher levels of complexity between two channels of data. These findings may indicate that SIB occurs due to over/understimulation to seek system stability 83 ("less" or "more" complexity) 84 (see Mazefsky et al. 83 for a review of emotional regulation in ASD, and Stergiou and Decker 84 for a review of nonlinear dynamics and pathology). Together, these results suggest that underlying nonlinear trends exist in the movements occurring during SIB. However, classic time-and frequency-domain features had the highest loadings on the first PC. These loadings indicate that jerk and FFT peaks are the strongest features of SIB, although nonlinear trends appear to differ between SIB and non-SIB. Consistent outcomes were found in prior work that used time-and frequency-domain features to accurately classify SMM 7,18,22,27 , and suggest that such features should be the first ones considered when creating a model to predict SIB. Further considerations for SIB modeling include variable intercepts and slopes. LR without variable intercepts or slopes performed inferiorly to LR with a variable intercept and inferiorly to MLR with both variable intercept and slopes. MLR with both variable intercept and slopes performed superior to all other classifiers, including commonly used machine learning algorithms implemented in other work 7,26 . These results imply that inter-individual variability also contributed to dataset variance.
This study is the first, to our knowledge, to incorporate nonlinear variability in addition to traditional time-frequency metrics to explain the variance in SIB movement data. Our findings suggest that movements in SIB can be described as a dynamical system with long-term deviations, which is consistent with prior evidence that stereotypical motor movements in ASD can be accurately detected using nonlinear features from RQA 39 . Similarly, our feature extraction revealed higher loadings for RQA features when compared to other nonlinear factors, and the associated principal components were significant in our MLR model. Our findings, along with those of 39 , indicate that RQA metrics could be critical in detecting repetitive and rhythmic motor movements (such as stereotypical motor movements, and SIB) in ASD.
Further, PCs with nonlinear variable loadings were significant in the MLR model. PC6 and PC9, with loadings primarily from RQA metrics from the Z and Y axes, respectively, were significant in the model only when randomly varying with participant level; these metrics were not sufficient to classify SIB unless including variable slopes, which indicates that nonlinear movement aspects are specific to each individual. Time-and frequencydomain features that loaded on PC1 (frequency components, jerk, and peak/minimum) were significant features in MLR when independent from participant levels, indicating that these variables could be predictive of SIB without considering individual variability; the only nonlinear metric of motor variability to which this finding applied was PC12 (cross-sample entropy). Other PCs with loadings from metrics of nonlinear motor variability (RQA for X on PC8, sample entropy on PC10) were significant only when considered as either a fixed or a variable effect, implying that features such as sample entropy vary consistently between participants while still explaining individual-level variability. Nonlinear measures were significant features of SIB when they varied with participants, which supports prior evidence that nonlinear components of movement are specific to individuals 85 .
We believe the current group-level model is the first to achieve accuracy of ~ 75% when identifying SIB among a diverse group of behaviors and participants. Previous research on classifying other repetitive motor movements 7,22,[24][25][26][27][28][29]39 and aggression 18,30 has evaluated specific models trained and/or tested only on individual participants, and performance dropped from 80% with individual models to 69% when applied to the group of participants 30 . Though a similar decrease was also evident here, percent accuracy was ~ 6% higher than earlier group-level results. The increased performance we found may be due, at least in part, to the use of feature selection and dimensional reduction methods, along with the multi-level properties of our model that accounted for inter-individual variability. Also, a larger sample of participants was included here, compared to earlier modeling reports of ASD behaviors, with samples ranging from one to six 7,17,22,[25][26][27]29 . Further, our participant pool encompassed 18 different behaviors across children 5-14 years of age, suggesting potential generality to a wider sample of children with ASD and SIB. Note that the 18 types of SIB mentioned in the study are a reflection of our study cohort and were an attempt to subtype SIB. We were not seeking to identify underlying subclasses of SIB, but instead to classify behaviors (as demonstrated naturally by our study cohort) automatically using technology (sensors and machine learning), rather than manually with traditional observation methods.
As in the work of 30 , regression showed promising results here compared to other classifiers; however, these earlier authors focused on aggression towards others, whereas our study applied regression on SIB data. MLR here had higher accuracy, specificity, precision, and recall compared to several commonly-used machine learning algorithms. These machine learning algorithms also detected SIB with high accuracy in our previous study using featureless data, though accuracy greatly decreased at the group-level (see Cantin-Garside et al. for further details) 34 . Multi-level regression with both variable slopes and intercept may be preferred for group data with variable behaviors that could be specific to an individual, and it may also be more accessible to interpretation than other machine learning algorithms. MLR had classifier performance superior to LR without varying intercept/ slope, further emphasizing the potential importance of individualizing models for participants with ASD and SIB. MLR, though, was only inferior to several highly-specific participant models, which may not generalize beyond the participant.
The widely varying performance metrics across participants, however, could account for the unexplained variance in MLR. Several participants had either near perfect detection (e.g., P1) or quite poor detection (P8 or P9) in testing. This wide range could have resulted from the inconsistent amount of SIB data included (P9 had the shortest session of all participants), the different types of SIB, or the variable sensor configurations between participants. Of note, MLR only included the one sensor worn in common by all participants: the wrist sensor. This single wrist sensor may not be sufficient for all SIB types, such as head banging or kicking, and therefore might have led to decreased performance measures in the group model. Yet, despite having only one sensor to incorporate in the group model, MLR still showed superior performance to all other tested classifiers. Group-level www.nature.com/scientificreports/ classifiers may be more practical (i.e., efficient and generalizable) to implement in real-world applications, and the current results are promising for automatically identifying diverse SIB with minimally-invasive technology.
Limitations and future work. Outcomes here provide initial groundwork toward creating a group-level classifier for SIB classification, yet further research is needed in several respects. Such work should include expanded data to improve classifier performance and to extend the model for more individuals with ASD, given the highly heterogeneous ASD diagnoses and the lifelong pervasiveness of ASD and SIB. SIB presentation can be extremely variable, including in its duration, and thus the current study is limited in size and scale (though more extensive than comparable existing studies). SIB data here may not have been sufficient for some participants, such as P8, when SIB episodes are few and/or short (< 100 s). There would thus be value in monitoring SIB across several days (longitudinal recordings) to capture additional episodes across different contexts, as well as expanding the study with a larger sample. Data from other episodes may also help increase explanatory power for the MLR, though accuracy is perhaps more critical for online detection of SIB. Although recall, sensitivity, and accuracy remained relatively high when testing MLR with unbalanced, natural data, precision decreased. This decrease in precision indicates that a "quasi-balancing" may be required when implementing classifiers in a real-world settings. At present, this technology would be most useful in settings where individuals frequently exhibit SIB (e.g., school), as there would be a greater need for support in these settings and a more balanced SIB:non-SIB ratio. A classifier could be deployed when caregivers cannot maintain both tracking and behavior management due to the high frequency and/or intensity of behaviors. It may be possible to improve MLR by weighting terms based on the frequency of the behaviors, as well as based on caregiver perception of imminent danger. Other methods of improvement could include nonlinear terms with variable slopes, though this could decrease interpretability of the model. Additional levels could also be incorporated into the model, such as age, SIB type, frequency or intensity. The definition of frequency or the rate of SIB can vary, depending on the type of behavior and the observer (e.g., counting each hit or counting each set of hits); thus, frequency was not included to describe behaviors. Behavior was classified by the presence/absence of SIB in the time window versus using a defined frequency. Similarly, intensity was not quantified objectively, though doing so would be a valuable contribution for future iterations of sensing technology for ASD. Operational definitions of the frequency of SIB and associated intensity, though, would be necessary to establish ground truth for inter-rater reliability. Further, additional analysis of effects of observation time and duration on classifier performance could support decisions about data requirements for training and testing data. Our work supports the presence of nonlinear motor variability within SIB. However, several features of nonlinear motor variability (DFA) loaded only modestly (< ~ 0.1) on PCs, which may be due to the young ages of the participant pool. The long-range correlations quantified by DFA only develop in gait during late childhood 86 , so such correlations may not yet be evident in SIB movements when the participants are young. Age could be incorporated as a covariate in future work, which may show differences in feature importance, such as in DFA, among age groups. Older participants could show more explicit anti/persistence in pathological movement, which could lead to additional evidence of nonlinear motor variability in SIB. Dynamic movement signatures of individuals with ASD could provide information to detect pathology, such as movements involved in SIB, before typical diagnostic measures 85 , and could explain the pattern of SIB onset. These individual movement signatures might also reveal trends about intentions that underlie SIB movements, such as whether the motion is goal-directed or spontaneous 85 , or the etiology of ASD, through mapping movement characteristics to underlying mechanisms of movement 87 . With additional information about ASD movement signatures, variability components (quantified with metrics such as RQA and entropy) could be the basis for an intervention to promote self-awareness and intentional movements in ASD 88 . Specifically, if SIB is a deterministic process, nonlinear motor variability metrics could capture the convergence or divergence of the repeated motions that may indicate the onset of SIB from sensor data. If values from such metrics surpass a certain threshold, the child would be considered at risk for starting an SIB. Stimuli (e.g., visual or auditory signals) could provide feedback to alert the child and divert the child's attention to alternative coping mechanisms (e.g., a breathing app, squeezing a sensory object, or feeling a certain texture).
In future work, we plan to use the current findings to build more sophisticated hierarchical models. One useful addition might be including Bayesian priors for individual-specific information. If such models improve accuracy while retaining interpretability, they could be used to determine the necessity of intervention at the earliest indicator of an event. Specifically, the predicted probability score from logistic regression can serve as a criterion for caregiver interjection, by setting a pre-determined threshold (e.g., caregivers should interject if the probability of SIB > 0.8). A monitoring system could also include real-time estimation of variable parameters (intercept and slope) for each individual with ASD and SIB. Parameter coefficients can be estimated by the empirical Bayes approach, which allows the mean value of the prior distribution to equal the mean of coefficients from the training data. Using new data in real-time, the posterior distribution can then be recomputed and updated for that participant. Continuously adapting features could both improve current models and address evolving behavior when tracking SIB. Alternatively, autoregressive models could be employed to account for the temporal dynamics of SIB. As in Rad et al. 28 , such an approach could address the common challenge of class imbalances in detecting SIB. However, applying more advanced models, such as an autoregressive model, would likely detract from the interpretability of the SIB prediction, and such a tradeoff would need to be considered carefully when developing a monitoring system.

conclusions
This work provides a framework for, and initial results obtained from, interpretable SIB classification at the group-level, particularly through introducing new features with variable slopes and intercepts in a multi-level classifier. A new application of nonlinear metrics to movement in SIB was employed, specifically to develop a group-level classification model. We found that both linear and nonlinear measures of motor variability and time/frequency-domain features, paired with feature selection and dimensional reduction, explained > 65% of the variance found in SIB movement data, and classified diverse SIBs among a group of 10 participants with ~ 75% accuracy. Our results are promising in terms of the feasibility of developing a continuous monitoring system for SIB that can be applied to different types of behaviors and a range of individuals. This work serves as a proof of concept for the utility of technology to track SIB in ASD, which is necessary to apply this work to future Phase 1 prevention efforts. Future work should continue to build on these results, with added consideration of prior distributions for adaptive modeling.