Introduction

The k-nearest neighbour (KNN) algorithm is a supervised machine learning algorithm predominantly used for classification purposes. It has been used widely for disease prediction1. The KNN, a supervised algorithm, predicts the classification of unlabeled data by taking into account the features and labels of the training data2. Generally, the KNN algorithm is able to classify datasets using a training model similar to the testing query by taking into account the k nearest training data points (neighbours), which are the closest to the query it is testing. Finally, the algorithm performs a majority voting rule to check which classification to finalise. Among all machine learning algorithms, the KNN algorithm is one of the simplest forms and is widely used in classification tasks because it has a very adaptive and easy-to-understand design3. The algorithm is renowned for its usage in solving regression and classification challenges for data of different sizes, label numbers, noise levels, ranges, and contexts4. Thus, this paper is forming a study around this algorithm based on classifying medical datasets, as predicting diseases is a real-world challenge. It is compelling to identify how it can adapt to aid this problem.

The algorithm is simplistic in its workings and calculations. It gives itself options to be modified in various aspects to decrease its limitations and challenges and increase its accuracy and applicability to be used in a wider variety of datasets. The classic KNN algorithm suffers from various limitations that abate its classification prowess, such as being unbiased to all its classification-dependent neighbours, lack of distance calculation features between data points, and taking into account unnecessary dataset features5. However, as KNN is adaptable to numerous modifications, it gives rise to different KNN forms or variants. The KNN variants differ in various algorithmic aspects, such as optimising the k parameter, improving distance calculations, adding weight to different data points, and truncating training datasets to resolve the challenges mentioned earlier6.

From the wide variety of research papers proposing different variants, the lion’s share of the KNN variants focuses on creating optimal k values. Wettchereck et al.7 and Sun and Huang8 proposed an algorithm called the adaptive KNN. They proposed training the training dataset itself to find the k value for each training dataset within a limited range. Then, through the k values of the training data, the testing data are classified to obtain the closest training data and attain its k value, through which it restarts its classification based upon the obtained parameter. While they proposed a widely adaptive k value finding a formula, Pan et al.9 proposed a super variation of the adaptive KNN called the locally adaptive KNN based on the discrimination class. This paper proposes a locally adaptive approach that considers multiple ranking methodologies. The paper states how their proposed algorithm decreases the limitation of only taking into account the majority classes by considering the minority classes and calculating an optimal k value through multiple probabilistic ranking formulae. There are also non-parametric KNN variants, such as the algorithm proposed by Cherif et al.10, that focus on finding optimal k values. They proposed a combination of two different algorithms to reduce the need to find an optimal k value. Their algorithm uses the k-means algorithm, which truncates the dataset into cluster points, and then runs the classic KNN algorithm to find the one nearest neighbour for the final classification. Another KNN variant that focuses on combining multiple algorithms to remove the need to find the optimal k value is the variant proposed by Hassanat et al.11, where the authors detail out a KNN algorithm with an ensemble approach. Their algorithm removes the need for a k parameter, as it performs iterative classifications using k values of a limited range.

Other than the variants focusing on finding the optimal k values, others focus on different internal aspects to improve accuracy. The KNN variant introduced by Han et al.12 and Yigit13 is an algorithm that focuses on weight attribution. Their algorithms take into account the weight factors of each nearest neighbour according to their distance and class frequency. These accounts decrease the limiting factor of taking all k nearest neighbours equally and increase the chances of the algorithm in predicting final classifications. Another KNN variant that focuses on weight attribution is the weighted mutual KNN algorithm proposed by Dhar et al.14. The algorithm works by truncating the training dataset into only mutual sets and running the testing dataset through it to classify the output from the nearest mutual neighbours. Their algorithm helps remove the noise from the training dataset and add weight attributes for the final classification. Keller et al.15, on the other hand, introduce an extra mathematical addition to the classic KNN algorithm, known as fuzzy sets. Their proposed algorithm, fuzzy KNN, focuses on membership assignment. The membership assignments are a different form of weight attribution, which calculates the probabilistic chances of a neighbour class becoming the final classification. Alkasassbeh et al.16 detailed another type of KNN variant that centres the point of attention on regular distance metrics. The paper states the usages of a new distance metric called Hassanat distance, which proves to be more efficient in classifying datasets than the traditional metrics of Euclidean and Manhattan distances. Another distance focusing KNN variant is the variant proposed by Gou et al.17. Their variation is different from the previous paper, as they do not propose a new distance metric but an algorithm that enhances the outputs of any distance formulae being used. The paper proposes generalised mean distance calculations and vector creations for the nearest neighbours of each different class. The algorithm is said to remove the limitations of being unbiased to weight attributions and enhance the accuracies by using their local mean vector calculations.

As discussed above, while most KNN variants focus on finding the optimal k values, other variants emphasise the overall classification accuracy. Each variant has its unique design and rationale. Each revealed the best performance in the corresponding study that first introduced it to the literature. No previous research attempts to make a performance comparison among different KNN variants. Therefore, to fill this gap, it is required to make a comparative performance analysis of these variants using the same datasets and experimental setup. By considering the parameter values leading to the best performance for each variant, this study used eight different disease datasets to compare the performance of 10 KNN variants.

K-nearest neighbour algorithm and its different variants

The Classic KNN Algorithm

The classic KNN algorithm is a supervised machine learning algorithm that is predominantly used for classification purposes18. The algorithm consists of a variable parameter, known as k, which translates to the number of ‘nearest neighbours’. The KNN algorithm functions by finding the nearest data point(s) or neighbour(s) from a training dataset for a query. The nearest data points are found according to the closest distances from the query point. After locating the k nearest data points, it performs a majority voting rule to find which class appeared the most. The class that appeared the most is ruled to be the final classification for the query.

Figure 1 illustrates an example. As k is 3 for Query B, it searches for the 3 nearest neighbours and finds that from the 3 nearest neighbours, two are of class 1, and 1 is of class 0. It then uses the majority voting rule to classify its class as 1. Similarly, as k is 5 for Query A, as there are a greater number of neighbours that are characterised as Class 0, it classifies its class as 0.

Figure 1
figure 1

Visual illustration of the KNN algorithm.

KNN variants considered in this study

Adaptive KNN (A-KNN)

The adaptive KNN algorithm is a variant that focuses on selecting the optimal k value for a testing data point7,8. It works by implementing a separate algorithm to determine the optimal k value for each data point of the training dataset. The main algorithm then finds the nearest neighbour from the training dataset and inherits its k value for a given testing data point. This KNN variant proceeds to function as the classic KNN algorithm to predict the output using this inherited k value.

Locally adaptive KNN with Discrimination class (LA-KNN)

This variant considers information from discrimination classes to determine the optimal k value. The discrimination class concept considers quantity and distribution from the majority class neighbours and the second majority class neighbours in the k-neighbourhood of a given testing data point9. The algorithm uses various steps to define discrimination classes. After selecting one of those classes, it proceeds to form a ranking table with different k values, distances from centroids and their ratio. From the table, it follows a ranking process to output the optimal k value.

Fuzzy KNN (F-KNN)

The fuzzy KNN algorithm revolves around the principle of membership assignment15. Similar to the classic KNN algorithm, the variant proceeds to find the k nearest neighbours of a testing dataset from the training dataset. It then proceeds to assign “membership” values to each class found in the list of k nearest neighbours. The membership values are calculated using a fuzzy math algorithm that focuses on the weight of each class. The class with the highest membership is then selected for the classification result.

K-means clustering-based KNN (KM-KNN)

The clustering-based KNN variant involves the combination of two popular algorithms: k-means and 1NN10. This variant uses the k-means algorithm to cluster the training dataset according to a preset variable (number of clusters). It then calculates the centroids of each cluster, thus making a new training dataset that contains the centroids of all the clusters. The 1NN algorithm is performed on this new training dataset, where the single nearest neighbour is taken for classification.

Weight adjusted KNN (W-KNN)

This version of the KNN algorithm focuses on applying attribute weighting. This algorithm first assigns a weight to each of the training data points by using a function known as the kernel function12. This weight assignment aims to give more weight to nearer points while giving less weight to faraway points. As the distance increases, any function that decreases the value can be used as a kernel function. The frequency of all nearest neighbours is then used to predict the output class of a given testing data point. This KNN variant considers the classification importance of different attributes in defining the kernel function for a multiattribute dataset.

Hassanat distance KNN (H-KNN)

The Hassanat KNN algorithm is a variant that has its focal point on the distance measurement formula. This variant follows the simple design of the KNN algorithm, but it proposes an advanced way to find the distance between two data points16. The new distance formula is called the Hassanat distance, and it revolves around the usage of maximum and minimum vector points, similar to weight attributions in other variants. The Hassanat distance metric of this variant calculates the nearest neighbours of a testing query and performs the majority voting rule, similar to the classic KNN algorithm.

Generalised mean distance KNN (GMD-KNN)

The generalised mean distance KNN or GMD-KNN is a variant that revolves around the principle uses of local vector creations and repeated generalised mean distance calculations 17. The algorithm works by first storing sorted lists of k nearest neighbours for each class. It then proceeds to convert each list to local mean vectors, from which multiple iterative mean distance calculations are computed to output a final value for the distance of each class to the testing query. The class with the smallest distance is then deemed to be the correct prediction for the testing query.

Mutual KNN (M-KNN)

The mutual KNN algorithm focuses on the principle of mutual neighbours14. The algorithm first transforms the training dataset by removing sets that have no mutual k nearest neighbours with the other sets. This creates a truncated training dataset that consists of less noise and anomalies. The algorithm then uses the testing dataset to find the k nearest neighbours from the training dataset and finds the k nearest neighbours of the testing datasets nearest neighbours. This allows the algorithm to determine the mutual nearest neighbours, which can be assessed as a candidate for classification. The testing datasets are classified using the majority voting rule.

Ensemble approach KNN (EA-KNN)

The KNN variant EA-KNN is based on an ensemble approach to remove the problem of having a fixed “k” parameter for classification. This algorithm works by using a Kmax value of \(\sqrt n\), with n being the size of the training dataset, to find the k-nearest neighbours of a testing query11. It then sorts the list of nearest neighbours according to the distance and performs weight summation operations on it. The weight summation operations are performed by iteratively adding an inverse logarithm function for “k” values starting from 1 to Kmax in increments of 2. The class with the largest weight summation is then deemed to be the predicted classification for the testing query.

Methods

Research datasets

The research that is being undertaken is based upon one primary domain, medical domains, and other secondary domains, which are purely random, to eliminate bias. Table 1 presents the datasets that are being used in this study and their respective attributes in terms of the number of features, data size and domain. They were taken from Kaggle19, UCI Machine Learning Repository20 and OpenML21. The datasets have different characteristics in terms of features, attributes, and sizes, and most belong to the medical domain for the relevance of disease risk prediction.

Table 1 A brief list of eight disease datasets considered in this study.

Performance comparison measures

Confusion matrix

The performance measures used to analyse the results are academically renowned and revolve around the usage of the confusion matrix28. Figure 2 presents the visual for the matrix. The matrix is the amalgamation of results from classifications and has four primary attributes that present the result data. If the classification is predicted to be 1 and the true value is 1, the result is classified as true positive (TP). The same principle revolves around the value 0 and is classified as true negative (TN). When the prediction is 1 and the true value is 0, the result is classified as false positive (FP), with the inverse being called false negative (FN).

Figure 2
figure 2

Confusion Matrix.

In this research, the confusion matrix is used to create three performance measures: accuracy, precision and recall.

The accuracy measure is calculated by taking all the true predictions and dividing them among all the predicted values, including the true predictions.

$$Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$

where TP, TN, FP and FN are the true positive, true negative, false positive and false negative cases of the result data, respectively.

The precision measure is calculated by taking the true positive values and dividing them among both true and false positive values.

$$Precision = \frac{TP}{{TP + FP}}$$

where TP and FP are the true positive and false positive cases of the result data, respectively.

The recall measure is calculated similarly to the precision measure by taking the true positive values and dividing them among the true positive and false negative values.

$$Recall = \frac{TP}{{TP + FN}}$$

where TP and FN are the true positive and false negative cases of the result data, respectively.

These three performance measures will be used to assess the classification results of the variants implemented in this paper. The complete set of these three performance measures will be used to create a new measure that will be discussed in the next section.

Relative performance index (RPI)

The relative performance index is a breakthrough assessor that collects data results of any other measure (accuracy, precision and recall, etc.) and produces a probabilistic result for the final assessment. The new performance measure that is being proposed here is inspired by another RPI measure proposed by Nagle29. The author proposed a new probabilistic calculation that removes bias by considering the range of results produced by a particular field and extracting the number of times the results of that field were above other fields. The RPI for a field is calculated using the extracted values and the number of fields that exist.

The following equation describes the new performance measure of this study:

$$Relative \;Performance \;Index \left( {RPI} \right) = \sum\limits_{i = 1}^{d} {\frac{{\left( {a_{i} - a_{i}^{*} } \right)}}{d}}$$

where \(a_{i}^{*}\) is the minimum accuracy/precision/recall value among all variants for dataset \(i\), \(a_{i}\) is the accuracy/precision/recall value for the variant under consideration for dataset \(i\), and \(d\) is the number of datasets considered in this study.

A higher RPI value indicates prediction superiority considering the underlying performance measures (e.g., accuracy and precision) and vice versa.

Results

The variants that require a k parameter were tested using k values of 1, 3, 5, 7 and 9. For this results section, the results of the k values with the highest performance in each variant were selected. The accuracy, precision and recall of the different KNN variants based on the research datasets are presented in Tables 2, 3 and 4, respectively. Table 5 shows the number of times a variant results in having the highest measure, and Fig. 3 introduces the RPI scores of each measure in average values.

Table 2 Accuracy (%) comparison among KNN variants.
Table 3 Precision (%) comparison among different KNN variants.
Table 4 Recall (%) comparison among different KNN variants.
Table 5 Comparison of KNN variants showing the number of times they presented the highest measurement values.
Figure 3
figure 3

Average relative performance index (RPI) scores for the three performance measures.

For the accuracy measure, according to Table 2, Hassanat KNN showed the highest average accuracy (83.62%), followed by the ensemble approach KNN (82.34%). However, according to Table 5, the ensemble approach KNN outputted the highest number of accuracies throughout the datasets (three times), followed by the locally adaptive KNN (two times). Regarding Fig. 3, Hassanat KNN had the highest accuracy score, followed by the ensemble approach KNN, with K-means clustering-based KNN showing the lowest score in terms of RPI accuracy.

Concerning the precision measure, according to Table 3, Ensemble Approach KNN showed the highest average precision (82.88%), followed by the Hassanat KNN (81.10%). The ensemble approach KNN outputted the highest precision in seven out of eight datasets, followed by Hassanat, weight adjusted and fuzzy KNNs (each with four times), according to Table 5. In Fig. 3, the ensemble approach KNN also obtained the highest RPI score regarding the precision measure, followed by the Hassanat KNN. The K-means clustering-based KNN variant also obtained the lowest RPI in this measure.

For the recall measure, as presented in Table 4, the generalised mean distance KNN showed the highest average recall (76.84%), followed by the locally adaptive KNN (76.27%). According to the results presented in Table 5, the highest number of times a KNN variant outputted the highest recall was the generalised mean distance KNN variant in three out of eight datasets, followed by locally adaptive and k-means clustering KNN (each with two times). As illustrated in Fig. 3, both the generalised mean distance and locally adaptive variants obtained the highest RPI score in the recall, followed by the locally adaptive variant. Finally, similar to previous RPI score measures, the K-means clustering-based KNN obtained the lowest score once more.

Table 6 presents the summary of each variant that resulted in the highest values in terms of its average, the number of times it showed the highest value and RPI scores for each of the three measures. This table provides essential insights into the potential candidate of the KNN variant to consider for disease prediction. For example, if the target performance measure of a research design is the average accuracy and RPI (accuracy), then the Hassanat KNN will be the most suitable one to consider. Similarly, one should consider the ensemble approach KNN if the target performance measure is precision.

Table 6 Summary of the different characterisations of measures of all KNN variants this study considered in terms of revealing the highest values.

Although different variants perform differently across all datasets for the three performance measures considered in this study (from Tables 2, 3, 4), we did not find any statistically significant difference in their group-wise values when we applied the one-way ANOVA. Table 7 presents the results from the one-way ANOVA test. None of the significance values is ≤ 0.05.

Table 7 Results from the one-way ANOVA test for checking the significance of the difference of three performance measures across the ten KNN variants considered in this study.

Finally, the advantages and limitations are distinguished and explained in Table 8.

Table 8 Comparison of KNN variants through advantages and limitations.

Discussion

The experiments and the results from the performance measures were comprehensive in themselves. The tables provided in the results section show that most of the variants outperformed the classic or general KNN in terms of accuracy, precision and recall. The measures of the variants were taken as the k values of 1, 3, 5, 7 and 9. These values were used across all the KNN variants that required a k parameter to run. The k value that performed the best for each measure was selected for the comparison for each variant. Although it is best known to use \(k = \surd n\), with n being the size of the dataset, for measuring performances30, the equal application of the same range of k values for each KNN variant returns the same non-biased results for a proper comparison. This study aims to compare the performance of each variant and not to output the optimal performance for each, as it would create biased results on the understanding that each algorithm runs optimally. Hence, the usage of equal parameter settings (i.e., considering the k value that generates optimal results) for each variant validates the juxtaposition of the performance measures.

The K-means clustering-based KNN variant performs the least in all aspects, even less than the classic KNN algorithm. This may be because the algorithm is formed around the principle of creating cluster centroids and using 1NN for the final classification. The research datasets are noisy and have outliers; thus, the low measures can be theorised to be due to the inaccurate creation of the clusters and their corresponding centroids. Additionally, the usage of 1NN only takes the nearest neighbour for classification, which is not enough to produce high accuracy. The algorithm could have been improved if a greater k value was used to increase the number of nearest neighbours to be used for the final majority voting rule. This would have decreased the bias from taking one nearest neighbour and enhanced the results in all three measures.

For choosing a variant to implement solely for accurate values, the Hassanat KNN variant is the most suitable option, as the improved distance metric of this variant proved its uniqueness and ability to handle data of different scales and noise levels very well. The ability to handle different scales of data proved well in this study, as the datasets that it was tested on were of all different types and scales of features. This variant might be ahead with respect to accuracy; however, it is lacking in terms of precision and recall results. The change in distance metrics is still not the optimal weight attribution replacement, and the variant is dependent on the k parameter. The variant can be theorised to be improved further if it consists of calculations to remove further noise from the training dataset or if it contains another step for better weight attribution during its majority voting rule process.

With regard to choosing a variant for improved precision values, the ensemble approach KNN variant is more fitting, as the variant is built on the principle of looking at class weights and consists of multiple iterations of different primary k values. The multiple iterations of primary k values abated one of the limitations of searching or needing an optimal k parameter. This variant’s unique design makes its classification highly precise. The variant is also not thus far off from producing high accuracy measures from the other variants, as seen from other result tables. The ensemble approach KNN algorithm could have been improved further if an optimal range of k values could have been found instead of constant primary values, as it would have increased its accuracy and precision measure results further.

Moreover, in terms of choosing a variant for greater recall, the generalised mean distance KNN variant would be the most appropriate selection. With its local vector and generalised mean distance calculations, this variant proves itself to work in generalising the need for weight attribution. Its unique characteristics make its classification results showcase the truest values on average throughout all the datasets. However, this variant is not high in other measures due to its variable k parameter. Not being able to find the optimal k parameter is a considerable limitation, which this variant does not consider. Knowing the optimal k parameter for each dataset can be theorised to increase the results of the other measures and improve the overall performance of this variant.

This study expectedly did not find any significant difference in performance measures for all variants (Table 7). All variants were intended to improve the baseline KNN when developed. An improvement of even 1% for any performance measures is considered a significant achievement for any classifications. However, such a slight difference would not make it statistically significant since the scale for each performance measure is 0–100%.

Conclusion

Overall, Hassanat, ensemble approach and generalised mean distance can be selected as the most suitable KNN variants for disease prediction according to their high accuracy, precision and recall measures, respectively. These variants approached different limitations of the classic KNN and outperformed the rest in overall performance. Considering the top three performing variants, the individual measures of the three showcase that the ensemble approach KNN variant obtained average performance measurement values higher than the rest. This variant achieved the highest measurement precision and subsequently performed well in both accuracy and recall measurements. The ensemble approach KNN is the prime variant to be chosen among the rest. Its unique design for tackling multiple limitations proves to handle medical datasets most prominently in disease risk prediction. In general, most of the variants prove how effectively they can be used in the medical domain by presenting their abilities to obtain high-performance measures in a wide set of research datasets. The medical field consists of different scales and ranges of data. Overall, these variants demonstrated their capabilities to subside the general constraints and classify datasets in this domain. The variants are also adaptable, capable of further enhancement, and can be revamped to abate more general limitations.

Throughout the research analyses conducted in this study, the potential of one of the simplest machine learning algorithms can be viewed clearly. The KNN algorithm is an algorithm with many limitations. However, the research has shown how it also has one of the most adaptable designs. The variants considered in this study have presented results that prove how their algorithmic mutations can aid in solving problems by searching for optimal k values, adding better weight attributions, calculating local mean vectors of neighbours, truncating datasets to remove noise, taking into account mutual neighbours, and more. The study has shown how these variants can diminish the limitations and be used in various real-world classification purposes, especially in disease prediction. Disease risk prediction is a rising challenge, close to a grand challenge. The classification accuracies of this algorithm’s variants provide proof of their potential to be improved further to limit constraints.

In terms of future work, the variants concluded earlier can be studied further to be modified individually or merged. These variants can be integrated with one another, e.g., the generalised mean distance KNN can inherit the distance metric of the Hassanat KNN variant. This would allow the underlying variant to diminish more limitations than originally designed. These merged KNN variants can be studied based on disease risk classifications. Further variants, in addition to the variants considered in this study, can be studied as well in the context of disease risk prediction, with a greater number of medical datasets. The designs of KNN are versatile and are open to change. Medical domain-specific datasets are expanding; however, mutations of KNN have the capabilities to handle a wide variety of dataset characteristics. Last but not least, another possible future scope is to compare the best performing KNN variants with other related algorithms, such as reptile search algorithm31 and Aquila optimiser32, available in the literature for disease prediction.