Introduction

With the continuous advancement of computational efficiency, artificial intelligence (AI) systems and their applications in a wide range of applications have gained significant acceptance and importance in our everyday lives1,2,3,4. These sophisticated AI-based systems are frequently employed in sensitive environments, contributing to making substantial and life-changing decisions. Hence, ensuring that these systems do not show any preferential or prejudicial behaviour towards certain groups or populations is crucial. Otherwise, they will be vulnerable to making biased or unfair decisions. Researchers are becoming more aware of the bias inherent in such AI-based systems and the resulting unfair decisions from their real-world commercial applications in various contexts, including hiring5,6 and pretrial detention and release decisions7. Therefore, knowing whether a machine learning (ML) algorithm could generate unfair or biased results before using those results for decision-making is critical. This study aims to develop an approach to evaluate the fairness of the deployed ML algorithm for a given dataset. Although AI encompasses a spectrum of technologies, from rule-based systems to ML algorithms, our focus in this article narrows to ML, a subset of AI where algorithms improve performance through data exposure.

Bias in AI-based systems can arise from various sources and manifest in different forms, each affecting machine learning fairness. Measurement or reporting bias, for example, may occur when data like facial recognition technologies are trained on non-representative datasets, leading to higher misidentification rates in underrepresented groups. Representation bias involves data that does not reflect all demographics, such as gender bias in job recommendation systems influenced by historical hiring data8. Sampling bias, such as training creditworthiness predictors solely on urban populations, leads to inaccurate assessments of rural individuals8. Aggregation bias might obscure specific needs within groups, as seen when medical data across ages fail to address elderly-specific health issues8. Linking bias introduces errors by incorporating irrelevant data, such as social media activity in credit scoring, while omitted variable bias involves missing crucial variables, like informal education paths in job screening processes, leading to unfair outcomes8.

Recent efforts to address biases in ML have led to many methodologies and frameworks to enhance reproducibility and fairness in research outcomes. Notable among these are initiatives such as the reproducibility challenge hosted by Princeton University9 and noteworthy contributions to research by others [e.g.,10,11]. Further enriching this landscape, foundational reviews12,13 delve into the challenges and solutions surrounding fairness in ML. Zhang and Sun14 explore innovations in unsupervised learning methods for time series clustering. At the same time, advancements in data processing within health informatics and communication networks are detailed by Ahmed et al.15 and Lakhan et al.16, showcasing how federated learning strategies are applied in complex data environments. These studies collectively provide a comprehensive overview of current challenges and technological developments that influence the field.

The measurement bias originates in how users choose, employ and measure particular features17. An example of this bias has been observed in the software tool used in the courts of the United States for predicting the chance of reoffending. This tool, named the Correctional Offender Management Profiling for Alternative Sanctions, considered ‘prior arrests’ and ‘family or friend arrests’ as a proxy variable to quantify the level of ‘riskiness’ or ‘crime’ for the future. Such a consideration can be viewed as mismeasured proxies since police surveil minority communities more often, making them higher arrest rates. However, due to the way they are assessed and controlled, one should not conclude that people from minority groups have higher arrest rates; therefore, they are more dangerous17. The representation bias arises if we select a non-representative sample from a population during the data collection phase. Such samples might generate a high accuracy for the training data. However, when adopted for real-world applications, they might reveal poor performance18. The sampling bias is similar to the aggregation bias and arises due to the non-random sampling of the subgroups within the population. The aggregation bias can be seen in clinical aid tools where a false conclusion may be drawn about an individual based on the entire population observation. For example, the HbA1c level is widely used to monitor and diagnose diabetes19. However, its value significantly differs in a complex way across various ethnic groups and genders. Therefore, a model that does not consider these two factors will not be a good fit for all ethnic and gender groups in the population. The linking bias arises when a model uses network attributes about individuals for prediction. A network attribute sometimes does not truly represent the activities and involvement of the underlying node or individual within the network. Wilson et al.20 show that users show significantly different interactions, in terms of method of interaction and time, compared with their social link patterns. Such a bias rooted in a network can result from many factors, such as network sampling21,22, which can make a notable change in the underlying network measures. The omitted variable bias occurs when the model does not consider an essential variable for the prediction23,24,25. A specific instance occurred when someone developed a model for the sale volume for a suburban restaurant with relatively high accuracy. Unexpectedly, the model shows poor predictive performance, although values for the model attributes remain almost unchanged. A further investment revealed that a new restaurant in the same area with a competitive price is the main reason for this worse prediction performance. The original model did not consider this feature.

An ML algorithm can exhibit various fairness levels when applied to different datasets26. A dataset could reveal disparate fairness levels against different ML algorithms. Even for the same dataset, one of its protected attributes (e.g., gender) might significantly lack fairness, while another (e.g., ethnicity) produces a fair outcome. Researchers followed different approaches to identify fairness. Zhang et al.27 proposed an explorative approach that can discover the potential biases, provide the underlying possible reasons for their presence and mitigate the most important one. D’Amour et al.28 used simulation to explore the long-term behaviour of the deployed ML-based decision systems. Researchers also suggest descriptive and prescriptive approaches, such as in29,30,31,32, for fighting against the ML fairness issue. Nonetheless, there is currently no proposed method capable of statistically establishing the existence of unfairness in supervised ML algorithms. Unlike existing approaches, which primarily explore potential biases or simulate long-term behaviours without establishing statistical proof of fairness or unfairness, our methodology integrates robust statistical testing within a cross-validation framework. Such a methodological approach allows for detecting and quantifying the degree of fairness in a statistically significant manner, considering various protected attributes such as gender and ethnicity. This capability to provide concrete statistical evidence of fairness sets our method apart, underscoring its novelty in a landscape where descriptive and prescriptive approaches have been prevalent but insufficient in statistically validating the fairness of supervised ML algorithms.

Given the shortcomings in current methodologies for evaluating fairness in AI systems, particularly the lack of statistical proof of fairness, we developed the following research objectives:

  • Can we develop a robust statistical testing methodology integrated within a cross-validation framework to detect, quantify, and address biases in ML algorithms?

  • Can this methodology be empirically validated across various datasets to ensure it effectively tests and demonstrates fairness in ML algorithms, considering different protected attributes such as gender and ethnicity?

  • How does the proposed methodology compare to existing approaches primarily focusing on identifying potential biases or simulating long-term behaviours without providing concrete statistical evidence of fairness?

This research introduces a simple yet original and inventive approach to detecting the existence of unfairness in a statistically significant manner by integrating robust statistical testing within a cross-validation setup for addressing these objectives.

Definition of fair machine learning

Fairness in ML, rooted in the philosophical and psychological discussions of equity and justice, lacks a universal definition within its domain. ML fairness is concerned with ensuring equitable treatment across all individuals, particularly in decision-making contexts that affect groups based on legally protected characteristics, such as gender, ethnicity, and socioeconomic status. To clarify, we differentiate three primary types of fairness: individual fairness, which ensures similar treatment for similar individuals; group fairness, which aims for proportional outcomes across different demographic groups to prevent systemic discrimination; and subgroup fairness (or intersectional fairness), which extends protections to intersecting group identities (e.g., Black women or disabled veterans), ensuring that combined characteristics do not lead to compounded biases8. This categorisation helps articulate the specific applications and implications of fairness, which is crucial for implementing sensitive and just ML practices in diverse real-world applications8. All fairness definitions in ML rely on simple or compound metrics associated with the confusion matrix, as illustrated in Fig. 1. The fairness definitions commonly employed in the algorithmic perspective within the ML context are outlined below:

Figure 1
figure 1

Illustration of (a) Confusion matrix and (b) Associated measures used in this study.

Definition 1 (Equalised Odds)

For a given dataset, the deployment of a supervised ML algorithm will be fair if, for the protected and unprotected groups (e.g., male and female), it shows equal values for true positive rate (TPR) and false positive rate (FPR)33. TPR is the proportion of actual positive instances correctly identified by a classification model out of the total number of actual positive instances. FPR is the proportion of negative cases incorrectly identified as positive out of the total negative instances.

Definition 2 (Equal Opportunity)

For a given dataset, deploying a supervised ML algorithm will be fair if, for the protected and unprotected groups, it shows equal values for TPR33.

Definition 3 (Treatment Equality)

For a given dataset, deploying a supervised ML algorithm will be fair if, for the protected and unprotected groups, it shows equal values for the ratio between false negatives and false positives34.

Definition 4 (Comprehensive)

The first (TPR) of the two associated metrics with the first fairness definition (TPR and FPR) is the only metric for the second definition. The third definition also has a single metric: the ratio between false negatives and false positives. This study considered a fourth definition by aggregating all conditions from these three definitions. According to this definition, which is a comprehensive way to define fairness, employing a supervised ML algorithm on a given dataset will generate a fair outcome if, for the protected and unprotected groups, it shows equal values for (a) TPR, (b) FPR and (c) the ratio between false negatives and false positives.

Proposed fairness evaluation approach

The protected and unprotected groups could show a very high or low difference for the three metrics (TPR, FPR and the ratio between false negatives and false positives) used to define fairness in ML. However, it is impossible to establish a statistically significant difference using only one value instance. We need multiple instances of this value difference to explore whether there is a statistically significant difference between protected and unprotected groups for any of these three metrics.

Our proposed approach is, therefore, designed to generate multiple instances of three key metrics, enhancing the robustness and reliability of our fairness assessments. To achieve this, we utilise an increased k value in the k-fold cross-validation process during the training phase when implementing a supervised ML algorithm on a specific dataset. In the Discussion section, we outline the criteria for selecting the optimal k value for k-fold cross-validation in our proposed methodology. By setting a higher k value, we ensure that each of the three metrics is calculated multiple times during the validation stage, providing a comprehensive view of the model performance across different subsets of the data. The k-fold cross-validation is a well-established technique to evaluate the performance of a predictive model35. During each iteration of the validation process, (k-1) of these folds are used to train the model, while the remaining fold is used as a validation set to test the model performance. This cycle is repeated k times, with each fold serving as the validation set once, ensuring that all data points are used for training and validation (Fig. 2). Ultimately, the strength of k-fold cross-validation, coupled with our approach to selecting k, positions our methodology as a robust tool for developing and validating ML models, particularly in applications where fairness and unbiased performance are paramount.

Figure 2
figure 2

Use k-fold cross-validation to create multiple instances of the desired metrics (TPR true positive rate, FPR false positive rate, FP false positive, and FN false negative). FP and FN are used to calculate their ratio values (Definition 3–Treatment Equality). For demonstration, we consider a k value of 10 here. It can take other values based on the underlying dataset.

Once we have multiple instances of these three metrics, we can employ statistical tests to check whether there is a statistically significant difference between the protected and unprotected groups. If the underlying attribute has only two groups, we can apply the independent-sample t-test; otherwise, one-way analysis of variance (ANOVA) or the Kruskal–Wallis test36. A t-test is a statistical test used to determine if there is a significant difference between the means of two groups for a given attribute37. One-way ANOVA is an extension of the t-test used to compare the means of three or more groups. The key idea behind the one-way ANOVA is to partition the total variability observed in the data into two components: the variability between group means and the variability within each group. If the former is significantly larger, it suggests a significant difference among group means36. The Kruskal–Wallis test is a non-parametric test used to compare the median values of three or more groups. When the assumption of the normality or equal variance is not met, this test is followed as an alternative to ANOVA36. A p-value below a certain threshold for these tests (t-test, ANOVA or Kruskal–Wallis) indicates a statistically significant difference in treatment or outcomes between groups. However, a p-value above this threshold does not confirm the absence of meaningful differences or imply fairness. These tests cannot distinguish between two test outcomes with p-values of 0.06 and 0.46 when a value of 0.05 is considered significant, which is a potential limitation of any statistical test. The contextual sensitivity of the underlying data should inform the selection of the p-value threshold (e.g., 0.05, 0.01 or 0.001) for determining statistical significance.

To illustrate thoroughly how the suggested approach operates, consider a dataset containing 12 attributes, one of which is race, having two potential values (white and black). The target attribute is a binary variable indicating the approval or denial of a home loan application. We created an AI system employing the support vector machine (SVM) algorithm. This system can determine the approval or rejection of a loan application based on the provided values for these 12 attributes. Suppose we investigate whether the SVM-based AI system produces a fair outcome against the categorical race attribute. We first need to split the data into two subsets: one for white people and the other for black people. Then, for each subgroup, we train the system through the k-fold cross-validation with \(k=20\). This training approach will eventually create 20 values for each of the three metrics (TPR, FPR and the ratio between false negatives and false positives) for each group. Since we have only two subgroups (white and black), we must apply the independent-sample t-test to investigate any statistically significant difference between the two groups for any of these three metrics. A statistically significant result indicates that the developed SVM-based AI system produces unfair outcomes for the race attribute of the given dataset. Figure 3 outlines the steps for this example.

Figure 3
figure 3

Illustration of the processes of how the proposed approach can determine the fairness of a deployed ML algorithm when applied to a given dataset with ‘gender’ as the protected attribute (SVM support vector machine, TPR true positive rate, FPR false positive rate, FP false positive, and FN false negative). For the ease of demonstration, we consider a kvalue of 20 here.

Application of the proposed fairness evaluation approach

This study considered five open-access datasets from the Kaggle (4) and the UCI Machine Learning repository (1) to demonstrate an application of the proposed fairness evaluation approach. Kaggle is a platform that offers robust tools and resources for the data science and AI community, including over 300,000 open-access datasets38. All four Kaggle datasets are from a disease prediction context and aim to make a binary prediction. Gender is the protected feature for the first three datasets. It is the race for the fourth one. The remaining dataset is from the UCI Machine Learning Repository, which compiles more than 650 open-access datasets, providing the ML with ample resources for empirical exploration39. Race is the protected feature for this dataset. Table 1 details these five datasets. We share the corresponding code on GitHub (https://github.com/haohuilu/fairml/).

Table 1 Details of the five datasets used for fairness evaluation.

We considered six classical ML algorithms to illustrate the application of the proposed fairness evaluation approach. They are SVM, Logistic regression (LR), Decision tree (DT), Random forest (RF), k-nearest neighbour (KNN) and Artificial neural network (ANN). Our proposed approach for assessing fairness can determine whether deploying one or more of these six ML algorithms against any of the five datasets will yield a biased or unfair outcome for the underlying protected attribute (i.e., gender for the first two and last datasets or race for the third and fourth datasets). We applied the default settings for hyperparameters as provided by the Scikit-learn library45. This approach minimises the potential for bias that can be introduced through extensive hyperparameter tuning. The supplementary material includes a comprehensive description of all parameter settings and configurations used for the classifiers. Further, we have compiled a comprehensive tabular summary of prominent ML algorithms to enhance the theoretical perspective of our analysis, detailing their respective approaches to quantifying fairness (see Table 2). The table covers various algorithms and the methodologies used to assess fairness across different demographic groups. This study chose a k value of 20 for the first three and fifth datasets. For the fourth one, it is ten since one of the target subgroups has a small number of instances. We used the default parameter settings of Scikit-learn in implementing these ML algorithms against the selected datasets46.

Table 2 Summary of Machine Learning algorithms and their approaches to quantifying fairness.

Based on the four definitions (as described in Sect.“Definition of Fair Machine Learning”), Table 3 presents the corresponding fairness outcome for the six ML algorithms against the five datasets. The first definition (Equalised Odds) involves two measures (TPR and FPR). Hence, it requires two independent-sample t-tests. The second (Equal Opportunity) and third (Treatment Equality) definitions require one t-test each since each considers only one measure for comparison. TPR is the only measure for the second measure. It is also present in the first definition. The last definition (Comprehensive) needs three t-tests, aggregating all measures from the first three definitions. Supplementary Fig. 1 shows the plots for the three metrics (TPR, FPR, and the ratio between false negatives and false positives) used in these four fairness definitions against five research datasets. Adopted DT and RF reveal a fair outcome against all four fairness definitions for datasets D1, D2 and D5. DT further shows a similar result for dataset D3. ANN shows the same results same datasets D1 and D2. For KNN, it is datasets D1 and D5.

Table 3 P-values from the independent sample t-test assessing the fairness of machine learning (ML) algorithms across five datasets. Four fair ML definitions from Sect.“Definition of Fair Machine Learning” have been considered. An ‘–’ (i.e., p > 0.05) indicates a fair ML outcome.

Among the selected ML algorithms, SVM is the only one that demonstrates unfair outcomes against one or more fairness definitions for all five datasets, as indicated in the second column of Table 3a–e. Notably, SVM exhibits unfair outcomes concerning all four fairness definitions for datasets D4 and D5, for whom race is the protected attribute. Following SVM, LR shows an unfair result most time. LR displays at least one unfair result for datasets D1-D4, with datasets D1 and D2 showing an unfair result across all four fairness definitions. Dataset D4 revealed an unfair outcome most times (20 times out of 24), followed by dataset D3 (seven times).

Although our proposed approach is demonstrated primarily for binary classification tasks and single protected variables, it can be applied to a more complex scenario, such as against classification tasks with more than two categories and/or multiple protected variables with two or more groups. When dealing with multi-class classification, the only change is in the dimensionality of the confusion matrix, which alters the calculation of metrics such as TPR, FPR, FP, and FN. However, if we have more than one protected variable (say two), each with two groups, we will have four sets of values (2 × 2) for each of the three metrics (TPR, FPR, and the ratio between false negatives and false positives). Similarly, if we have three projected variables with two groups each, there will be six sets of values (3 × 2) for these three metrics. In such cases, ANOVA or a similar statistical test should be used for statistical comparison instead of the t-test to accommodate the increased complexity.

Discussion

Based on the k-fold cross-validation and the statistical t-test, our study tackles a pertinent research issue within the domain of fair ML by introducing a classical fairness evaluation methodology. We implement the proposed approach on five benchmark datasets, with gender as the protected attribute in three and race as the protected attribute in the remaining two. This study observed variability in fairness outcomes across four different fairness definitions and six ML algorithms for the same dataset.

The selected k value in the k-fold cross-validation within the proposed fairness evaluation approach may vary or be reduced based on factors such as dataset size and subclass statistics. If one of the subgroups based on the underlying protected variable is small (e.g., under 100), a higher k value will leave a few data instances for the validation phase. For D5, we considered k = 10, while for the other four datasets, it is 20. The ultimate goal of any ML algorithm is to develop a model that will perform well with the training and new unseen data. In this regard, selecting an appropriate k value is very crucial. A high value (k = n) can lead to a higher bias since the underlying model has been trained on almost the entire dataset in each fold, and there is only an instance for validation47. In ML, bias is the difference between the average prediction of the model and the correct it is trying to predict. However, the variance, defined as the changes in model performance when using different portions of the training data, is likely to be lower with a high k value because the model is trained on an extensive set of data in each fold47. A lower value (k = 2 or 3) could also lead to a higher bias as the model may not capture the underlying patterns effectively within the data. The variance tends to be higher since the model is trained on smaller subsets, making it more sensitive to variations in the training data. Hence, a bias-variance trade-off is significant in the real-world application contexts of ML-based applications. The selected k value of the k-fold cross-validation is a fundamental factor in achieving this trade-off48. Our illustration of the proposed fairness evaluation approach used a k value of 10 and 20. This study did not consider the bias-variance trafe-off factor in selecting these values. Such consideration could have an impact on Table 3 findings.

The results from the statistical test underscore a pivotal debate in the fairness of ML algorithms: different ML models may exhibit varying degrees of fairness when applied to the same dataset. This variation emphasises the intricate relationship between model architecture and fairness outcomes. For instance, in D1, methods such as DT, RF, KNN and ANN demonstrate fairness, as indicated by "–" (meaning p > 0.05) in Table 3a. However, LR exhibits unfairness across all four definitions, whereas SVM only shows unfairness for Definition 3 (Treatment Equality), illustrating the variability in how different models align with or deviate from fairness criteria. This discrepancy can be attributed to how each model processes the underlying data and the sensitivity of each model to the protected attributes. This scenario opens up a complex landscape where the inherent characteristics of a model significantly influence its fairness, suggesting that no one-size-fits-all solution exists for achieving fair ML.

The debate over varying fairness outcomes for different ML algorithms for the same dataset leads to a broader discussion on defining fairness in ML. Fairness is not a monolithic concept but rather multifaceted, with various definitions applicable depending on the context. The variance in fairness outcomes across different ML models for the same dataset accentuates the challenge of adopting a universal fairness definition. For instance, one definition of fairness may prioritise equal outcomes (Definition 1 Equalised Odds) across groups, while another might emphasise equal opportunities (Definition 2–Equal Opportunity) to achieve these outcomes. This complexity is further compounded by the technical characteristics and assumptions embedded within each ML model, which may align or conflict with a specific fairness definition. One of the key strengths of our proposed methodology is its adaptability to various fairness definitions beyond the four we have explored in this study. It is designed to accommodate future fairness metrics that may differ from the three (TPR, FPR, and the ratio between false negatives and false positives) we currently utilise. This flexibility ensures that our approach remains applicable and relevant as new definitions and metrics emerge.

Further complicating the discussion is the consideration of ensemble approaches, which combine multiple ML models. The variance in individual model fairness outcomes raises the question of how ensemble methods, integrating potentially fair and unfair models according to specific definitions, impact overall fairness. Ensemble methods, such as RF, are designed to improve prediction accuracy by aggregating the predictions of several models49. However, the fairness of these aggregated predictions remains an open question, especially if the ensemble includes a mix of models that individually exhibit both fair and unfair outcomes. This highlights a crucial area for further research: understanding and mitigating potential biases introduced through ensemble methods, which might not only inherit but also amplify the biases of their constituent models.

How a machine learning model works can significantly contribute to its fairness. The architectural intricacies of a model, including how it learns from data and makes predictions, can inherently influence its fairness outcomes. More transparent and interpretable models like DT may offer more precise insights into how fairness is achieved or compromised. In contrast, more complex models, such as deep learning models like ANN, may obscure the mechanisms leading to fair or unfair outcomes. This understanding is pivotal in devising strategies to enhance fairness, such as feature selection, modifying the learning algorithm, or applying post-processing fairness corrections.

The discussion on ML fairness is incomplete without acknowledging that the definition of fairness is context-dependent. Different applications and domains may require different fairness considerations, reflecting varied societal norms and ethical considerations. For example, fairness in a healthcare application might focus on equal predictive accuracy across different racial groups. In contrast, fairness might prioritise equal loan approval rates across genders in a financial application. This diversity necessitates a flexible approach to defining and achieving fairness tailored to the specific needs and impacts of the underlying ML applications.

The fairness of ML algorithms is a multifaceted issue influenced by the choice of model, the definition of fairness, and the application context. This research highlights limitations, including the need for cross-validation with real-world evidence to bolster the robustness of fairness assessments. Additionally, the choice of k in k-fold cross-validation emerges as a critical factor influencing t-test results and fairness interpretations. This test may not be able to detect small but potentially meaningful differences, especially in scenarios where sample sizes are limited, or effect sizes are small, despite its potential power in statistical comparison and wide usage in various contexts50. Another possible limitation is that the variability observed in the three metrics could stem from factors other than the algorithmic implementation techniques. This includes, among others, the algorithmic sensitivity to specific subsets of data, the inherent instability of the algorithm, non-determinism in training due to uninitialised pseudo-random number generators, and effects of parallelisation or GPU usage. Future research should focus on these implementation issues and develop more flexible fairness definitions that accommodate diverse ML applications. In addition, we aim to incorporate explainable artificial intelligence techniques to enhance the transparency and understandability of ML-based models, aligning with best practices such as those detailed in the literature51. This inclusion will improve model interpretability and ensure our methodologies adhere to the ethical standards and guidelines established in the field of responsible AI. Moreover, exploring methods to enhance the fairness of ensemble approaches and investigating the impact of model architecture on fairness outcomes is imperative. This endeavour will contribute to advancing fair ML practices and ensure that ML applications enhance equity and justice across all sectors of society.

Conclusion

In conclusion, this research introduces a novel approach to assess fairness in ML algorithms, demonstrating its application across five diverse benchmark datasets. The findings underscore the complexity of achieving fairness, evidenced by the variability in fairness outcomes across different ML models and fairness definitions. Key insights include the lack of a universally fair ML model, the contextual nature of fairness, the challenges posed by ensemble methods, and the impact of model architecture on fairness outcomes. The research highlights the importance of adopting a nuanced perspective on ML fairness that is sensitive to model selection, fairness definitions, and application contexts. Limitations such as the need for further validation with real-world datasets and the influence of parameter selection on fairness assessments suggest areas for future research. These include developing adaptive fairness definitions to suit varied applications, addressing issues related to algorithmic implementation techniques, enhancing fairness in ensemble methods, and exploring the relationship between model architecture and fairness. Addressing these areas will advance the field towards more equitable ML practices, ensuring that AI technologies contribute positively to societal needs across all domains.