Deep learning for early detection of pathological changes in X-ray bone microstructures: case of osteoarthritis

Texture features are designed to quantitatively evaluate patterns of spatial distribution of image pixels for purposes of image analysis and interpretation. Unexplained variations in the texture patterns often lead to misinterpretation and undesirable consequences in medical image analysis. In this paper we explore the ability of machine learning (ML) methods to design a radiology test of Osteoarthritis (OA) at early stage when the number of patients’ cases is small. In our experiments we use high-resolution X-ray images of knees in patients which were identified with Kellgren–Lawrence scores progressing from 1. The existing ML methods have provided a limited diagnostic accuracy, whilst the proposed Group Method of Data Handling strategy of Deep Learning has significantly extended the diagnostic test. The comparative experiments demonstrate that the proposed framework using the Zernike-based texture features has significantly improved the diagnostic accuracy on average by 11%. This allows us to conclude that the designed model for early diagnostic of OA will provide more accurate radiology tests, although new study is required when a large number of patients’ cases will be available.

Texture features are designed to quantitatively evaluate patterns of spatial distribution of image pixels for purposes of image analysis and interpretation. Unexplained variations in the texture patterns often lead to misinterpretation and undesirable consequences in medical image analysis. In this paper we explore the ability of machine learning (ML) methods to design a radiology test of Osteoarthritis (OA) at early stage when the number of patients' cases is small. In our experiments we use high-resolution X-ray images of knees in patients which were identified with Kellgren-Lawrence scores progressing from 1. The existing ML methods have provided a limited diagnostic accuracy, whilst the proposed Group Method of Data Handling strategy of Deep Learning has significantly extended the diagnostic test. The comparative experiments demonstrate that the proposed framework using the Zernike-based texture features has significantly improved the diagnostic accuracy on average by 11%. This allows us to conclude that the designed model for early diagnostic of OA will provide more accurate radiology tests, although new study is required when a large number of patients' cases will be available.
Osteoarthritis (OA) is the most common musculoskeletal condition and a major cause of disability in older adults. It is the fifth most important cause of disability in high income countries. It is estimated that in the UK OA causes loss of 200 disability-adjusted life years per 100,000 people. New methods for early diagnostics of OA are urgently needed in order to improve patient's treatment outcomes. Without reliable OA diagnostics new treatments cannot be developed and evaluated 1 .
Radiologists can identify the pathological changes associated with OA by analysing high-resolution knee X-ray images. Typically these changes cause the narrowing of joint space and development of bone spurs, leading to pain and impaired movement in patients. These pathologies are diagnosed in patients with developed symptoms such as joint pain [2][3][4] . However at early stages small pathological changes in bone microstructure can be evaluated by using a high-resolution technology such as MRI which is costly and not widely accessible. The use of advanced image analysis is expected to provide cost-efficient diagnostic solutions capable of delivering reliable estimates of pathological changes in bone microstructure [5][6][7][8][9][10] .
In many cases textures of patterns existing in images can be efficiently defined by spatial distributions of pixels 11,12 . Estimates of the pixel distribution enable patterns of interest to be represented quantitatively. In practice however the designed features can be incomplete, irrelevant, or have unexplained variations which affect the interpretation accuracy and reliability and so can cause undesirable consequences 13,14 . The desired texture features have to be capable of representing the main variations of the pixel distribution so as to explain structural changes in patterns of interest 12,15 . Specifically X-ray texture descriptors have been used for OA diagnostics [16][17][18][19] . However the texture-based features are not yet capable of delivering accurate, reliable, and reproducible diagnoses, because the results are influenced by X-ray technological conditions such as modality, exposure, blur, magnification, and projection angle. The diagnostic results are also influenced by the natural variations in the bone textures between patients of one group 20 . www.nature.com/scientificreports/ Image processing techniques such as Fourier and wavelet transforms have been also used for detecting OA in X-ray images 21 . In particular the combinations of radiological features have improved the detection accuracy, although the new features are difficult to interpret as markers. The diagnostic values of the combined features were empirically tested by using Fisher-score statistics 3 .
Another approach has combined Gray-level matrix with the 2-dimensional Gabor filter to find new texture features capable of improving the detection accuracy. The diagnostic values of the newly generated features have been also empirically estimated 15 .
The designed features are expected to provide a high diagnostic value when images are rotated and have different scales. The texture features which are invariant to the rotation can be designed by using Zernike orthogonal moments. The required invariance to the image scale has been obtained by using regular geometrical moments. The high-order Zernike moments have been used for solving image classification problems 22,23 .
The use of Zernike moments for early diagnostics of OA within a Machine Learning framework has been proposed in our previous work 24 . This framework has efficiently learnt new texture features from high-resolution knee X-ray images. Although the Zernike moments were computationally efficient to provide the required invariance, additional efforts are still required in order to find new radiological markers capable of improving the detection accuracy. However the existing Machine Learning methods are still limited in delivering reliable solutions 25,26 .
The accuracy of diagnostic methods is particularly dependent on sample size of patients' cases which have been collected and clinically verified for purposes of designing a diagnostic model. It is important to note that the sample size of patients' cases verified at the early stage is typically small, whilst the collection of a large amount of cases is expensive and resource demanding. The most accurate results are achieved when diagnostic models are designed with representation learning which extends experts' knowledge 27,28 . However in practice experts often cannot define an optimal framework within which a diagnostic model could be designed so as to provide a high accuracy. Representation learning allows experts to extend their knowledge by designing new model structures and features capable of increasing the diagnostic accuracy.
An early Deep Learning strategy is known as Group Method of Data Handling (GMDH) which has been proposed to learn models of "optimal" complexity required for achieving the maximal accuracy in prediction and classification 29,30 . Within the GMDH framework a multilayered model is designed from reference functions which have a small number of arguments. GMDH generates a new layer while the model's performance increases. Such a strategy iteratively grows model connectivity and makes the GMDH strategy particularly efficient for designing models on a small data set. The GMDH-type neural networks have been efficiently used for finding solutions to the medical problems [31][32][33] . The GMDH framework has been successfully used in our previous work on EEG classification 34 as well as on learning EEG features for biometric identification 35 .
In this study we aim to extend the experimental evidences that the proposed GMDH framework significantly outperforms the Machine Learning techniques in terms of the accuracy of detecting the pathological changes in the bone microstructure which are related to OA at early stages. The evidences are provided by the experiments with the texture features and the techniques such as Random Forest (RF), Support Vector Machines (SVM), and Artificial Neural Networks (ANN). The comparison is made with the techniques whose parameters were experimentally optimised. The comparative experiments were run with the Haralick features and Zernike moments. As the number of patients' cases was small in our study, the performances were compared within the leave-oneout cross-validation. We also discuss limitations of our study and finally draw a conclusion that the proposed GMDH framework will improve the accuracy of radiology opinions in similar cases of early diagnostics of OA.
The novelty in our study is outlined as follows. The new radiological markers based on the high-order Zernike moments used within the proposed GMDH framework provide a high sensitivity to the pathological changes related to OA at the early stage when the number of patients' cases is 40, including 20 healthy and 20 patients identified at risk of OA.
The rest of the paper is structured as follows. The proposed methodology of image representation and learning of GMDH-type neural networks is described in the section Methods. This section also outlines the secondary data used in our study for experimental validation of the proposed method. The sections Results and Discussion present the outcomes obtained on the data and finally outline the main conclusions that can be drawn from the study.

Results
In this section we describe the main results obtained with different texture features and Machine Learning methods on the high-resolution X-ray images which are outlined in the section Data. The experiments were run with Haralick features and Zernike moments described in the section Methods.
The experiments on the X-ray data were conducted with Machine Learning techniques such as RF, SVM, ANN, and the proposed GMDH-type network. RF is a bootstrap-based method which can significantly increase the accuracy by using e.g. classification trees 36 . The performances of the above techniques were optimised, as described in the section Methods.
The main results are shown in Table 1. We can see that the Zernike moments provide a better accuracy than Haralick features for all techniques used in our study. Note that the confidence cannot be calculated within the leave-one-out cross-validation used in our study on the small data. The GMDH-type network has outperformed the RF, SVM, and ANN for both types of texture features. In comparison with the two-sample Kolmogorov-Smirnov (KS) statistic test used in the section Data, the diagnostic accuracies have been improved from 72.5 to 85% for the Lateral images and from 67.5 to 77.5% for the Medial images.
In practice features used for designing a diagnostic model often make unequal contributions to the diagnostic problem. The analysis of feature importance provides new insights into the diagnostic solution. For example a new diagnostic model could be designed without a feature making a weak contribution. For models such as RF and GMDH, the importance of a feature could be defined as a frequency of its use in the designed model. Figure 1 shows the importance of the Zernike moments used in the RF (in Grey) and proposed GMDH-type network (in Black) over the 81 moments calculated for order 16, which are arranged by the increasing order. We can seen that the GMDH network has employed the moments of a higher order more frequently than the RF. Thus a conclusion can be drawn that the Zernike moments of a higher order are more informative for detection of OA at early stages.

Discussion
Texture features are used for representation of image patterns observed as spacial distributions of pixels. Unexplained variations in the texture features lead to misinterpretation and undesirable consequences. In practice, texture features can largely vary, which makes the design of informative radiological markers problematic.
The above problems significantly affect the accuracy of early diagnostics when pathological changes in patients cannot be reliably detected within standard examinations and the amount of data which is required for designing a diagnostic solution is limited. In our previous work 24 a solution to this problem has been suggested within a GMDH framework using texture features. The proposed solution is compared with the other Machine Learning techniques on on the patient's cases which have been retrospectively verified at risk of OA in the high-resolution X-ray images of knees made available in the national study 37,38 . Early diagnostic cases of OA are difficult to clinically verify because the related pathological changes cannot be reliably detected by the standard radiology examination based on the Kellgren Lawrence Scoring 39 .
Our previous work has been extended in part of comparison with the Machine Learning techniques optimised during the experiments. The comparative experiments were run on the X-ray images which represent the patients' cases including new cases of OA at the early stage. The number of the patient's cases has been increased to 40 which is still small. The leave-one-out cross-validation used in the comparative experiments has been used for evaluating the diagnostic accuracies. The estimation of confidence intervals requires larger data sets and so was out of our research scope. Thus the main findings and insights into the early diagnostic problem are limited and cannot be directly extended to similar cases.
The first observation in our experiments was that the KS statistic test applied to the images in the Control and Case groups cannot provide a high diagnostic accuracy. Having undertaken a GMDH-based approach we have explored the texture features based on Zernike orthogonal moments which are computationally efficient and invariant to the image rotation and scaling.
The second observation was that the texture features based on the Zernike moments do not make equally important contributions to the diagnostic problem. The new radiologic features which were learnt from the small amount of patients' data within the proposed GMDH framework have improved the diagnostic accuracy at the early stages. This is because the GMDH can iteratively increase the complexity of network connectivity at each new layer while the network performance increases. In each layer the new features are generated and those which provide best fit are selected. www.nature.com/scientificreports/ In our experiments with the texture features we used different Machine Learning techniques. The experimental results show that the new radiologic markers based on the Zernike moments, which were learnt within the proposed GMDH framework, have significantly improved the diagnostic accuracy. In comparison with the baseline KS-based discrimination the diagnostic accuracy has been improved on average by 11%. Thus we conclude that the proposed method can efficiently learn new texture features from high-resolution X-ray images, which are capable of improving the diagnostic accuracy of OA at early stages.

Methods
Data. In our experiments we used the data collected in the national study 37,38 which has been conducted according to the approved ethical regulations and study protocols. The data include the high resolution X-Ray images of knees in subjects. The subjects involved in the study were healthy or diagnosed at risk of OA. The samples of X-ray images as well as details of X-ray imaging are available from the study coordinators 37,38 . The referenced study particularly provides the instrumental information on X-ray imaging, such as: DX modality, Dexela Detector, 75 µm , 3072 × 1944 pixel resolution, 14 Gray-level bits.
The patients' cases were retrospectively analysed by experts to identify the cases when the pathological changes had not yet developed and their radiology examinations based on Kellgren-Lawrence (KL) scoring 39 were estimated at grade 1. Such cases have been further investigated in order to identify those with developed OA confirmed by the KL examination at a grade higher than or equal to 2. As the radiology study has been conducted within a given period and a defined population, the number of the patients identified to be with the progressing KL grades was small.
The Case (pathology) group in our study includes 20 subjects at risk of OA at the early stage. The Control group includes 20 subjects randomly selected from the healthy population. The images of both groups were taken from the Lateral and Medial compartments of the tibia bone. The Region of Interests (ROIs) required for our experiments have been automatically selected by using an entropy-based method 40 . Other image segmentation methods [41][42][43] are capable of providing ROIs selected from high-resolution knee X-ray images. The average size of the ROI images was 150 by 150 pixels. The ROIs which were used in our experiments are available at https ://doi. org/10.6084/m9.figsh are.83039 96. Figure 2 shows the distributions of the brightness values in the Lateral (left) and Medial (right) images of the Case and Control groups shown in Red and Blue, respectively. Table 2 shows the mean value and standard deviation σ of the pixel brightness over the Control and Case groups. We can see that the mean brightness for the Case image group is higher than that of the control group in the both Lateral and Medial regions. This observation can be explained by an increasing bone density in patients of the Case group 16 .
Taking into account the above finding, we could hypothesise that the density distributions of pixel brightness in the images of Control and Case groups are significantly different. To test this hypothesis let us define the brightness distributions of the images of Control and Case groups, C 0 and C 1 , respectively. Then having a distribution of an image, C x , we can use the two-sample KS statistic to test the hypothesis H 0 , that the samples i x are drawn from either the population C 0 or population C 1 . The p-values, p 0 and p 1 , are calculated for the Control and Case groups to verify the hypothesis H 0 at a given significance level α = 0.05 . The hypothesis H 0 is rejected  www.nature.com/scientificreports/ if both p 0 and p 1 are smaller than α . In such a case the given image cannot be recognised between the groups at the given significance level. Let the KS statistics D 0 and D 1 be the maximal distances between the empirical distribution functions estimated for an image I x and the Control and Case populations C 0 and C 1 , respectively. Then we can assign the image I x to a group with the minimal distance min(D 0 , D 1 ).
The above algorithm is used in our study as the baseline diagnostic rule. Table 3 shows the accuracy obtained with this baseline KS-discrimination on the Lateral and Medial types of images. The rejection rate shown in this table is calculated as a percentage of the images for which the hypothesis H 0 was rejected. At the given significance level the diagnostic results were rejected for 62.5% of the Lateral and 70.0% of the Medial images.
Texture features. Haralick features 11 are based on Grey tone spatial dependencies (the Grey level co-occurrence matrix) which allow for extraction of statistical characteristics between pairs of pixels in the image. These characteristics are computed along a direction and a distance between pixel pairs. The co-occurrence matrix includes the contrast and entropy calculated for the image. The contrast estimates the difference between Greylevel values of pixels, while the entropy measures the randomness or homogeneity of the pixel distribution over the image coordinates.
Following the work 23  where m pq is the moment of ( p + q)th order.
The above moments m pq are not orthogonal and so cannot be calculated efficiently. Zernike polynomials which are orthogonal provide efficient computation of the features. Zernike moments are calculated as the orthogonal complex polynomials V nm (x, y) which can be rewritten within the unit circle x 2 + y 2 ≤ 1 as: where n >= 0 are positive integers, m are positive or negative integers subject to the constraints: |m| ≤ n and n − |m| are even, ρ is the length of a vector to a (x, y) pixel, and θ is the angle between the x-axis and the vector ρ.
The above R xy are the radial polynomials which are defined as: Zernike moments A nm are the projections of an image f(x, y) onto the orthogonal polynomials R nm . The Zernike moment of order n with repetition m is Additional details of the Haralick features and Zernike moments used in our study are provided in the section Experimental Settings. GMDH-type neural networks. The GMDH-type Deep Learning adopted in our study is based on heuristic-based optimisation of multilayer neural networks with a polynomial activation function. GMDH generates new features and grows up a network connectivity on given training data 29 . The heuristic optimisation is based on so-called "external" criteria aimed to select networks which are fitted best to the training data, This allows GMDH to efficiently prevent networks from overfitting, which affects the network ability to generalise and predict unseen data. New features are generated in each new layer by combining the input variables and the outputs of neurons from the previous layers. The number of layers grows while the network performance increases. When the network connectivity becomes "optimal", the generalisation ability reaches the maximum value which can be achieved within the defined activation function and selection criteria.
In GMDH-type networks activation functions are defined as short-term polynomials which can be linear or non-linear, y = g(x; w) , where x is the input vector, and w is the coefficient vector. Using a given activation function g(x; w), the GMDH generates a new feature vector x = (x i 1 , x i 2 ) for K neurons at the layer r = 1 , with indexes i 1 = i 2 , i 1 = 1, . . . , m, and where K = m(m − 1)/2 is the number of pairwise combinations for the given m input variables. The outputs of neurons at the rth layer, y i , are written as (1) m pq = x y x p y q f (x, y), (2) V nm (x, y) = V nm (ρ, θ) = R nm (ρ)e (−jmθ) ,  Having the outputs y i on the entire data X, the regularisation errors can be calculated as follows: The errors are sorted in ascending order, and then the first F neurons with the lowest errors are selected for the next layer. The neurons with correlated outputs are excluded from the selection y i 1 , . . . , y i F . and so the number of selected neurons at the rth layer can be F r : 1 ≤ F r ≤ F . The neurons at the next layers r + 1 are generated by applying the activation function g to extended data Z (r) = [Y (r) | X] , which include the outputs of the selected neurons Y (r) = [y i 1 , . . . , y i Fr ] and the input data X. Thus the matrix Z (r) contains m r = F r + m columns.
Similarly, the coefficient vectors ŵ (r+1) are estimated for the i 1 , i 2 columns of the matrix Z (r) . The outputs of neurons y (r+1) are then calculated as follows: where i 1 = i 2 ; i 1 = 1, . . . , m r , i 2 = 1, . . . , m r , and K r = m r (m r − 1)/2 are the number of pairwise combinations for the m r columns.
GMDH generates the new layers and the network grows while the number F r exceeds a given threshold. The pseudocode of the described GMDH algorithm is outlined below.
Learning of GMDH-type neural networks. The main steps of learning a GMDH-type network are represented by Algorithm 1. The Algorithm defines the training data [X |ẙ] and the number of neurons F (or "freedom" of choice) to be selected for new layers. The first layer r = 1 is generated and then neurons which are best fitted to the data are selected for the next layer. The algorithm stops if the number of the selected neurons, F r , becomes less than the given threshold F 0 .
The procedure InitiateNet assigns the network parameters Rn, Ln, In, Wn, and Zn. Here Rn are the neuron indexes r, Ln are the neurons errors, In are the inputs i 1 , i 2 of neurons, Wn are the coefficient vectors of neurons, and Zn are the outputs of the selected neurons.
In GenerateFirstLayer the algorithm generates the first layer of neurons with the inputs x i 1 , x i 2 . The neurons are adjusted to the data and the network parameters are updated in part of , ŵ , and y.
The procedure GenerateNewLayer defines the indexes of neurons at the previous layer. Then the number F r and the indexes A 1 of the selected neurons are defined. For the given inputs z i 1 , z i 2 the new neurons are generated according to the procedure GenerateFirstLayer.
In SelectBest the algorithm generates the neurons at the given layer r. The generated neurons are sorted out by the regularisation errors stored in Ln in the ascending order. The first F r neurons with the lowest errors are selected for the new layers. The selection does not include the neurons with the correlated outputs. The procedure returns false if the number of the selected neurons is below the threshold F 0 . Otherwise the procedure returns true to further grow the GMDH-type network.
The procedure UpdateNet fits the coefficient vector ŵ to the given input U. The regularisation error is evaluated for the fitted neurons and then the network parameters Net are saved for the next layers.
The GMDH algorithm generates the new layers whilst the network performance is increased by a defined value. Otherwise the algorithm stops and the grown network is represented by the matrix Net. on their parameters which have to be optimised in a "try-and-see" way during experiments on given data 44 . The diagnostic performance was estimated using the leave-one-out cross-validation method which is typically used when sample size is small 45 .
It is important to note that standard Principal Component Analysis cannot be efficiently used for data dimensionality reduction when the data sets are small. This is because the principal components calculated on a small data set are subject to large variability 46 .
In our comparative experiments the maximal performances of the RF, ANN, and SVM have been achieved with the following settings.
1. The best RF was with the number of classification trees = 200 , the minimal number of samples at terminals = 3 , attribute rate = 0.8 , and sample rate = 0.7 2. The best ANN was with the number of hidden neurons = {7, 12} , the training metod=Levenberg-Marquardt, the learning rate = {0.2, 0.6} , early stopping rule, activation = logsig 3. The best SVM was with a Radial Basis Function kernel and a gamma optimised on the cross-validation.
The maximal performance of the GMDH has been experimentally found with the following parameters.
1. The freedom of choice F = m , where m is the maximal number of texture features 2. The minimal number of neurons F 0 = 2 3. The activation function y = α 0 + α 1 x 1 + α 2 x 2 + α 3 x 1 x 2 4. The size of training data was 31, and 8 data samples were used for the external estimation in each of the 40 rounds of the leave-one-out cross-validation.
The Haralick features were used with the three textural parameters extracted from the grey level co-occurrence matrices 11,47 . Table 4 shows the Haralick texture features used in our experiments.
The Zernike moments were calculated for order 16 so as to generate 81 features. In our experiments the real part of the moments were taken as the texture features.

Regulation statements
The authors make the following statements: The knee X-Ray images selected for our experiments have been collected in the national study 37,38 conducted according to the approved ethical regulations and study protocols. The study was reviewed and approved by the National Committee for Data Protection (Comissão Nacional de Proteção de Dados) and by the NOVA Medical School Ethics Committee. Ethical Committees of Regional Health Authorities (ARS) also reviewed and approved the study.
All methods used in our study were performed in accordance with the relevant guidelines and regulations. The experiments described in our paper have been conducted on knee X-Ray images of participants provided informed consent within the above mentioned national study.