Artificial intelligence to improve back pain outcomes and lessons learnt from clinical classification approaches: three systematic reviews

Artificial intelligence and machine learning (AI/ML) could enhance the ability to detect patterns of clinical characteristics in low-back pain (LBP) and guide treatment. We conducted three systematic reviews to address the following aims: (a) review the status of AI/ML research in LBP, (b) compare its status to that of two established LBP classification systems (STarT Back, McKenzie). AI/ML in LBP is in its infancy: 45 of 48 studies assessed sample sizes <1000 people, 19 of 48 studies used ≤5 parameters in models, 13 of 48 studies applied multiple models and attained high accuracy, 25 of 48 studies assessed the binary classification of LBP versus no-LBP only. Beyond the 48 studies using AI/ML for LBP classification, no studies examined use of AI/ML in prognosis prediction of specific sub-groups, and AI/ML techniques are yet to be implemented in guiding LBP treatment. In contrast, the STarT Back tool has been assessed for internal consistency, test−retest reliability, validity, pain and disability prognosis, and influence on pain and disability treatment outcomes. McKenzie has been assessed for inter- and intra-tester reliability, prognosis, and impact on pain and disability outcomes relative to other treatments. For AI/ML methods to contribute to the refinement of LBP (sub-)classification and guide treatment allocation, large data sets containing known and exploratory clinical features should be examined. There is also a need to establish reliability, validity, and prognostic capacity of AI/ML techniques in LBP as well as its ability to inform treatment allocation for improved patient outcomes and/or reduced healthcare costs.


INTRODUCTION
Low-back pain (LBP) is the leading cause of disability worldwide 1 and is associated with annual economic costs up to AU $9.2 billion 2 and US $102 billion 3 in Australia and the United States of America, respectively. In addition to economic burden, multiple individual factors (e.g. loss of social identity 4 , distress 5 and physical deconditioning 6 ) contribute to pain intensity and disability in this population group 7 . Approximately 90% of people with LBP are classified as having 'non-specific' LBP, where no clear tissue cause of pain can be found 8 . However, we anticipate that people with non-specific LBP are not a homogeneous group, yet the challenge remains to identify potential sub-groups that could benefit from specific treatments to assist in reducing the burden of the condition 9 .
Artificial intelligence and machine learning (AI/ML) techniques have been used to improve the understanding, diagnosis and management of acute and chronic diseases 10 . Technological advancements, such as machine-learning algorithms, have led to an increased capacity to recognise patterns in data sets, and used successfully to classify individuals with liver disease and heart failure 10,11 and have found some application more widely in pain research 12 . However, the utilisation of such techniques in LBP, to date, is limited. The primary aim of this work was to conduct a systematic review examining how machine-learning tools have been used in LBP.
A classification approach or assessment tool that is implemented in clinical practice should have utility: be it for the patient (e.g. improved outcomes) and/or for the healthcare system (e.g. reduced costs). Any classification tool should ideally be (a) reliable, (b) valid, (c) detect people who are likely to have a different outcome or prognosis and (d) its implementation in clinical practice should improve patient outcomes, reduce healthcare costs and reduce the burden of disease [13][14][15] . To illustrate the current status, and potential future direction, of AI/ML approaches to LBP, we contrasted this to two commonly implemented clinical classification approaches (McKenzie 16 and STarT Back 13 ). The McKenzie method has been extensively studied in randomised clinical trials (RCTs) and subsequent meta-analyses of LBP treatment 17 , while the STarT Back tool is currently recommended in national guidelines 18 . McKenzie is a classification method of diagnosing movement preferences (e.g. spinal extension versus flexion) based on symptom response (e.g. centralisation versus peripheralization of symptoms) 16 , while the STarT Back classifies people in to low-, medium-and high-risk of developing persistent disabling symptoms based on physical and psychosocial factors 13 . A comparison of AI/ML utilisation to these existing clinical classification approaches can guide future work in subclassification of LBP using AI/ML, specifically allowing for the development of a more robust tool that has the potential to impact the burden of disease of LBP. Therefore, (a) the primary aim was to systematically review the literature on AI/ML in LBP research, (b) while a secondary aim was to systematically review and contrast two common LBP classification approaches that are in active use in clinical practice (McKenzie and STarT Back) to how AI/ML tools have been used to date. To do this, we considered the reliability, validity, and prognostic capacity of these classification systems, as well as their impact on patient outcomes (e.g. pain intensity and disability) and healthcare costs, as determined in RCTs. 1

Machine learning
Despite broad search terms, only 185 articles were identified after duplicate removal, with 64 assessed at the full-text stage (Fig. 1). The reasons for exclusion of AI/ML studies at the full-text stage are presented in Supplementary Table 1. A total of 48 studies were included in data extraction and qualitative synthesis ( Fig. 1)   .
No studies have used AI/ML techniques to assess LBP prognosis of pre-defined sub-groups on pain and disability outcomes. However, nine studies assessed the prognosis of LBP based on input parameters 21,22,27,30,31,46,51,52,59 . Studies examined prognosis prediction using AI/ML techniques of: satisfaction after lumbar stenosis surgery 21 , recurrent lumbar disc herniation 22 , recovery from acute LBP 27,30 , recovery from CLBP 31 , poor outcomes following lumbar surgery 46,51 , successful outcomes from cognitive behavioural therapy 52 and recovery based on pain chart   30 , while the other study reported a sensitivity and specificity of 88% and 86%, respectively 46 . Four studies 38,48,65,66 assessed the ability of AI/ML approaches to, using existing data sets, diagnose nerve root compression, 'simple' LBP, spinal pathology and abnormal illness behaviour in LBP. These models achieved an accuracy of 82% and 90%, respectively 38,48,65,66 . Two studies aimed to predict vertebral pathologies with an accuracy of 90−92% 58,61 . Lastly, one study used a decision support system for LBP diagnosis with an accuracy of 73% 60 .
No prospective clinical trials have been performed using AI/ML tools for LBP treatment allocation. However, two studies 26,43 looked at treatment allocation pathways. One study looked at computer-assisted prediction of LBP treatment, but did not report any accuracy values nor clearly the number of treatment pathways 26 . The other study used 1288 fictional cases to train the data set and a training sample of 45 humans 43 . The highest accuracy for predicting appropriate treatment allocation reported was 72% 43 .
Five studies 35,36,39,45,56 did not clearly fit the classification, diagnosis, prognosis or treatment allocation titles. Two studies assessed the prediction of pain intensity in LBP based on pain intensity and skin resistance 45 and spinal motion data 56 . The use of sleep actigraphy to determine daytime pain was assessed in one study using an ANN 36 . Another was used to predict neural adaptions based on psychosocial constructs using a Multivariate Pattern analysis 39 . Lastly, one study assessed self-report and objective activity data to categorise acute and chronic LBP using an ANN 35 .
An overview of risk of bias from the NOS is shown in Table 2   Hallner et al. 27 1 1 1 1 0 0 1 1 0 6 / 9 Jarvik et al. 30 1 Jiang et al. 31 Shamim et al. 46 Other a Selection Comparability Outcome Total Higher scores indicate better quality. a Neither case−control nor cohort study design.
for disability 74,[83][84][85][86]89,93,94,96,97,102,105,108 , while two showed significant prognostic benefits on mixed pain intensity and disability analyses 80,81 . Of the multivariate models, two studies showed the STarT Back to predict prognosis for pain intensity adjusted for baseline pain 90,91 , while four showed no significant association 71,72,78,93 . Eight studies assessed prognosis for disability in multivariate models adjusted for baseline levels of disability with, six studies in favour 71,72,83,90,93,102 and two against 78,91 a significant association. Four clinical trials assessed the STarT Back for classification and treatment allocation-compared outcomes to standard care (Supplementary Table 5) 15,76,95,110 . Of these, two were nonrandomised trials, one which showed significant benefits of stratified care for pain and disability outcomes 95 , while the other only showed significant benefits for disability 110 . The two RCTs showed no significant effects of stratified care on pain intensity 15,76 , while one showed a significant effect for disability 15 . One RCT 15 and one non-randomised trial 110 assessed the cost effectiveness of stratified care when compared with standard care, with no significant differences observed.

McKenzie method
Overall, 29 studies were included within the McKenzie review ( Supplementary Fig. 2)   . The reasons for exclusion of McKenzie studies at the full-text stage are presented in Supplementary Table 6.
Eight studies looked at the inter-tester reliability and classification ability of the McKenzie method (Supplementary Table  7) 113,115,121,122,[131][132][133]136 . Overall, seven studies assessed the reliability with a Kappa value range of 0.02−1.00 113,121,122,[131][132][133]136 . Only two of these studies had Kappa ranges >0.6; thus, five studies had poor to moderate agreement 140 . One study also showed that 31% of individuals were not able to be classified with the McKenzie method 115 . Validity of the McKenzie method as a classification system cannot be tested, as there is no gold standard comparator 141 .
Prognosis on pain intensity or disability based on McKenzie principles, such as directional preference, centralisation versus peripheralization and pain pattern classification, was assessed in 11 studies (Supplementary Table 8) 114,117,120,124,128,130,134,135,[137][138][139] . The duration of follow-up of these studies ranged from 2 weeks to 1 year. Four studies reported the follow-up as when the patient was discharged; however, they did not provide a timeframe 114,130,138,139 . Three studies showed that classification was a significant predictor of pain intensity in univariate models 114,135,139 , while one did not 117 . No studies aimed to assess the classification on pain intensity in a multivariate model when adjusted for baseline values. For disability, five studies showed no significant benefit of classification on prognosis 117,128,130,134,137 , while five showed a significant effect 114,120,124,138,139 . Only two studies assessed disability prognosis within multivariate models, with one showing significant 138 and one non-significant results 137 .
The search identified 11 clinical trials that used the McKenzie assessment and then provided treatment based on the individuals classification compared to another intervention or treatment (Supplementary Table 9) 111,112,116,118,119,123,[125][126][127]129,130 . The comparators in the trials consisted of standard physiotherapy 111 , chiropractic treatment 112 , back-care booklet 112 , back school 116 , motor control exercise 118,126 , endurance exercises 119 , first-line care 125 , manual therapy 127 , general advice 127 , intensive strengthening 129 and spinal manipulation therapy 130 . Five of 11 trials showed significant benefits for pain intensity, which favoured McKenzie treatment at the end of intervention 111,112,119,123,125 . For disability, four of 11 studies showed significant benefits favouring McKenzie treatment at the end of intervention 111,116,119,123 . Three studies 111,123,125 assessed McKenzie compared to standard care, with all studies showing significant results favouring McKenzie for pain intensity and two for disability 111,123 . Three studies 112,119,127 assessed McKenzie compared to advice or education, with two showing significant improvements in pain intensity 112,119 and one in disability 119 , favouring McKenzie. Compared to passive treatments, such as manual therapy or mobilisations, three studies showed no significant differences for pain intensity and disability 112,127,130 . Three studies compared McKenzie to active treatments, with no significant results for pain intensity or disability observed 118,126,129 . One study compared McKenzie to Back School, with significant results favouring McKenzie for disability but not pain intensity 116 . One study assessed costs with no differences observed between McKenzie therapy and standard chiropractic treatment 112 .

DISCUSSION
AI/ML are becoming more widely used in disease management and has potential to impact LBP treatment 12 . This systematic review assessed the current status of these approaches in the management LBP. In comparison to other classification approaches, applying methods of AI/ML for LBP is currently in its infancy. The results of our review show that machine-learning tools, such as ANNs and support vector machines, have attempted binary classification (presence of LBP or not), recovery prediction and treatment allocation in LBP. The accuracy of models included in this study ranged from 61 to 100%. However, there are several important limitations in existing AI/ML research.
Study sample sizes used for AI/ML-based LBP classification or prognosis were typically small for machine-learning approaches, with 23 of 48 studies having a sample size <100, 22 of 48 studies with a sample size between 100 and 1000 and only 3 of 48 studies with a sample size >1000. Additionally, 19 of 48 studies typically used a small range of parameters (≤5 factors). This may be a limitation, given most AI/ML studies of non-specific LBP aimed to classify individuals using only physical factors, such as trunk range of motion, electromyography and sitting posture 20,23,24,28,29,32,37,[40][41][42]54,57 ; omitting important psychosocial parameters that are known to be involved in patients with LBP. Only Darvishi et al. 25 and Parsaeian et al. 44 utilised a range of physical, psychological and social factors for the classification of LBP; however, they did not attempt sub-classification that delineate sub-groups that could benefit from specific treatments. LBP sub-classification is important as LBP, especially chronic (>12 weeks) LBP, is characterised by changes to a series of systems: biological, psychosocial and the central nervous systems and there are likely sub-groups within this population 142 . Notably, some studies applied many models to small CLBP data sets (n < 100) to yield highly accurate results; however, these were only focused on the binary classification, determining only the presence of CLBP 20,24,28,29,42 . In machine learning, normally, the sample size should be no less than 2 k cases (where k is the number of features), with a preference of 5 × 2 k 143 . Therefore, these studies may be prone to overfitting of data and the best fit model is likely not applicable to other LBP samples 144 . Overall, 25 studies within this review assessed the role of machine learning on classification of individuals with LBP. To develop a robust subclassification tool, various conditions such as reliability, validity, accuracy, ease of implementation, treatment allocation yielding clinically meaningful benefits and reductions in healthcare costs should be met 145 . The current evidence for the use of AI/ML highlights that the utility of these approaches is yet to be realised in a clinically meaningful way.
For comparison, we also conducted systematic reviews of two other classification systems for back pain: STarT Back tool (classifies people in to low-, medium-and high-risk of developing chronic pain based on physical and psychosocial factors) 13 and the McKenzie method (diagnosing movement preferences; e.g. spinal extension versus flexion) 16 . The reliability (i.e. the consistency of the classification system over repeated attempts with the same patient) 146 of the McKenzie method was poor to moderate 113,115,121,122,[131][132][133]136 and moderate to excellent for the STarT Back tool 74,75,82,87,98,99,101,103,109 . This limits the ability of the McKenzie method to be a useful classification system for people with LBP, as this impacts the ability to identify a movement or structure that benefits from a specific treatment 141 . Construct validity (i.e. degree of which the measure reflects what it is trying to attain) 146 of the STarT Back tool ranged from weak to strong 68,71,74,75,79,82,87,98,103,109 and discriminative validity (i.e. the ability to discriminate between various groups of individuals or sub-groups) 146 was poor to excellent 13,14,68,69,73,82,88,100 . Three studies achieved poor discriminative validity for a singular subscale 14,88,100 , while all other values were above acceptable. Validity of the McKenzie method as a classification system has not and cannot be assessed, as there is no gold standard comparator 141 . Based on our findings from these two systematic reviews, if AI/ML is to make an impact on LBP management, it will likely need to develop greater reliability and validity compared to current approaches and advance sub-groups to improve clinical and societal outcomes through appropriate treatment allocation (Table 3).
In assessing the ability of a classification system to predict prognosis (i.e. the trajectory of a condition based on certain subgroup factors) of people with LBP, it is critical to account for the patients' pain and disability when they are first assessed, as these factors are the strongest and most consistent predictors of pain and disability in the months after LBP incidence [147][148][149][150] . The STarT Back tool was typically (in six 71,72,83,90,93,102 of eight 78,91 studies and 2080 of 2634 patients) able to predict future disability, but this was less consistent for pain intensity (two 90,91 of six 71,72,78,93 studies and 348 of 1899 patients). For the McKenzie method, no studies assessed the effectiveness of the classification method on future pain intensity while accounting for baseline values. For disability, two studies of McKenzie assessed disability prognosis this within multivariate models, with results mixed (significant in one of two studies and 109 of 832 patients) 137,138 . The utility of the tool to effect overall improvements in patient outcomes has not been tested extensively for the STarT Back tool. One nonrandomised trial showed significant benefits for pain intensity and disability when implementing the STarT Back compared to usual case (n = 582) 95 . Of the two RCTs, neither showed benefits of stratification on pain intensity (1324 patients); however, one showed significant improvement for disability compared to usual care (one of two studies and 568 of 1324 patients) 15,76 . The McKenzie method has been tested in 11 RCTs 111,112,116,118,119,123,[125][126][127]129,130 , but in comparison to other active and passive treatment approaches is not more effective.
To build on current machine-learning approaches, research should investigate the ability to create sub-groups of individuals with LBP that considers a broader range of biopsychosocial factors, similar to that of the STarT back tool. The use of a broader range of clinical factors incorporated within an AI/ML approach using a large training data set may enable for more reliability, validity, prognostic capacity, and improved stratification of treatment for patients with LBP 9 . Such an approach may therefore lead to improved clinical outcomes for clients and reduced healthcare expenditure; however, this is yet to be determined. To date, only one study has aimed to employ this approach in LBP with a narrow set of physical factors 43 . Oude et al. 43 used 1288 fictional cases to develop a model of self-referral in LBP, which was then applied to 45 real cases with a modest accuracy of 72%. Furthermore, the study did not assess if the model could lead to improved clinical outcomes and reduced healthcare costs 43 . A limitation of such approaches is that they fail to consider psychosocial and central nervous system factors that are associated with the condition, such as kinesiophobia 151 , pain catastrophizing 152 , pain beliefs 153 , pain self-efficacy 154 , depression 5 , anxiety 5 , occupational factors 155 , sensory changes 156 and structural and functional changes to the brain 157,158 . Including these factors may allow for specific subgroups to be identified that could benefit from targeted treatments to maximise clinical benefits. Future models that aim to classify treatment approaches need to consider these broader psychosocial and behavioural factors to enhance accuracy and clinical utility of the model.
The strengths of the current study include the use of broad search terms to identify all the relevant literature pertaining to the use of artificial intelligence in LBP. Even with these terms, we were only able to identify 185 articles for title/abstract screening. Furthermore, we completed two additional systematic reviews to contrast how machine learning could build on current classification approaches in LBP. For limitations, for clinical trials, due to the low number of studies and heterogeneity between studies, metaanalysis could not be performed. Furthermore, we considered the overall interaction of STarT Back classification tool (e.g. combination of all groups) when assessing the effectiveness for the intervention on pain, disability and costs. Some groups may have had significant effects, while others did not 15  Machine learning has the potential to improve the management of LBP via sub-classification of an otherwise homogenous diagnosis such as non-specific LBP. Identifying relevant subgroups among patients with LBP would permit the determination of diagnostic categories that inform clinical decision-making and treatment choice. This systematic review found that current machine-learning approaches are reported to have high accuracy; however, they are often applied to small data sets with multiple models. To determine the utility of such approaches in future research, studies implementing machine learning in LBP need to examine larger sample sizes, examine a variety of known risk factors across multiple domains (e.g. spinal tissue, psychosocial and central nervous system) in each model and attempt subclassification through data clustering within the model. The classification approaches need to be reliable, robust, evaluated, detect sub-groups with different prognosis and inform allocation of patients to treatment such that patient outcomes and/or healthcare costs are, overall, improved. Ultimately, this kind of approach to sub-classification has the potential to drive improvements in the global health-related burden of disease.

Search strategy
These systematic reviews were prospectively registered with PROSPERO prior to beginning data extraction (as registration numbers are still pending, protocols were uploaded to the Open  Table 12 Inclusion and exclusion criteria For inclusion, studies must have examined LBP and the utilisation of AI/ML techniques, the STarT Back or McKenzie method in humans. LBP was defined as pain localised below the costal margin and above the inferior gluteal folds 159 . No restrictions were included based on race, sex or age. Studies were required to be a full peer-reviewed journal or full conference publication (i.e. grey literature excluded). For AI/ML approaches in LBP, there was no restriction on study design, to ensure all research on this approach to date was identified. For STarT Back or McKenzie there was the inclusion criterion that the study must have examined: (a) reliability, (b) validity, (c) prognosis and/or (d) treatment effects (such as in a clinical trial). There was no restriction on study design as long as those topics were addressed. Exclusion criteria were: not peer reviewed or full conference abstract, not English language, not low-back pain, not AI/ML or STarT Back or McKenzie classification (e.g. if not clear individuals were assessed and treated via their profile) and not original research. AI/ML studies that did not evaluate the role of AI/ML in patient classification, prognosis or treatment (e.g. automated radiographic image analysis, automated pain diagram analysis) were excluded.

Data extraction
Data extracted included relevant publication information (i.e. author, title, year, journal), study design (e.g. cross sectional), study overview (free text), number of participants, type of LBP (e.g. acute, subacute, chronic, unclear) and summary of authors' conclusions (free text). For AI/ML articles further extraction acquired the AI/ML techniques implemented, parameters used as inputs, whether data were split into training and testing data sets and the main results (e.g. the highest sensitivity, specificity, accuracy and area under the curve that are available). For both the STarT Back and McKenzie reviews, additional data were extracted for reliability, validity, prognosis and treatment effects from subclassification (e.g. significant improvements to pain intensity, disability and healthcare costs). When it was not possible to extract the required data, this information was requested from the authors a minimum of three times over a 4-week period. Any discrepancies were discussed by the two independent assessors with disagreements addressed via an adjudicator (P.J.O.).
Definitions used in the systematic review For studies of AI/ML in LBP, we considered the following categories of classification, sub-classification, prognosis, diagnosis and treatment allocation. Classification was considered as the ability to discriminate individuals with LBP from healthy populations, while sub-classification was defined as the ability to subgroup individuals with LBP based on different clinical characteristics (e.g. anatomical, psychological and nervous system alterations) 145 . Prognosis was considered the ability of clinical variables or an assessed sub-group to predict recovery or non-recovery (i.e. clinical course) of pain intensity or disability from LBP 160 . Diagnosis was defined as the ability to determine the cause of LBP, which could be based on anatomical, psychological and nervous system factors 161 . Treatment allocation was determined to be the prediction of a type of treatment that could benefit a certain individual with LBP 162 . Studies that did not clearly fit in these definitions were classed as 'other' studies.
Cut-offs for reliability and validity Internal consistency (i.e. the degree of which components of a measure are related) was considered acceptable if Cronbach's α values ranged from 0.7 to 0.9, while values ≥0.9 were considered strong 146 . Test−retest (i.e. the consistency of the classification system over repeated attempts with the same patient) was considered as acceptable above an intraclass correlation   Prognosis prediction was considered 'adequate' when the classification approach resulted in statistically significant prediction of outcome after adjusting for baseline pain or disability in multivariate models [147][148][149][150] . h Treatment effect was considered 'adequate' when the classification approach resulted in a statistically significant improved patients outcomes for pain or disability or healthcare costs in randomised or nonrandomised clinical trials.
S.D. Tagliaferri et al. coefficient (ICC) of ≥0.7, whereas values ≥0.9 are considered acceptable for individuals; therefore, we considered these values as strong 146,163 . When Kappa scores for intra-rater (i.e. agreement of repeated measurements on the same patient) or inter-tester (i.e. the agreement of measurements between different clinicians) reliability were available, values were considered as poor agreement (0−0.2), slight agreement (0.21−0.40), moderate agreement (0.41−0.6), good agreement (0.61−0.8) and excellent agreement (0.81−1) 122 . As recommended for disability research, construct validity correlations (i.e. degree of which the measure reflects what it is trying to attain) 146 above 0.6 were considered as strong, 0.3−0.6 as moderate, and below 0.3 as weak 146,164 . Discriminative validity (i.e. the ability to discriminate between various groups of individuals or sub-groups) 146 followed principles set by Hill et al. 13 for the STarT Back with an area under the curve of 0.7−<0.8 indicating acceptable discrimination, 0.8 −<0.9 indicating excellent discrimination and ≥0.9 indicating outstanding discrimination.
Risk of bias Risk of bias was assessed by the Newcastle−Ottawa Scale (NOS: http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp), which is recommended for quality assessment of case−control and cohort studies by the Cochrane Collaboration group 165 . The NOS is split into selection, comparability and ascertainment of exposure/outcome categories, with a maximum score of nine points awarded. Based on this, studies were determined to be good, fair or poor quality as previously determined 165 . The methodological quality was determined by two independent reviewers (S.D.T. and D.L.B.). Results were compared with disagreements discussed to reach a verdict, with adjudication by P.J.O. if necessary.

DATA AVAILABILITY
All data are available upon request.