Introduction

Which country’s national healthcare system (NHCS) responded better to the outbreak of the COVID-19 pandemic? The novel coronavirus 2019 (COVID-19/SARS-CoV2) outbreak, often called the pandemic of the century, has profoundly impacted healthcare systems worldwide since it was first identified in Wuhan, China1,2. Characterized by high transmission and fatality rates, COVID-19 has disrupted the normal functioning of healthcare systems globally3.

At the peak of each wave, public health authorities cancelled all non-urgent healthcare services to free up capacity for the expected surge in COVID-19 patients, aiming to prevent healthcare system collapse, as witnessed in countries like Italy4 and India5. Countries implemented various measures to suppress the spread of COVID-19, maintain healthcare capacity, deliver high-quality care, and reduce secondary mortality rates. However, physical distancing, movement restrictions, and the cancellation of non-urgent services led to other healthcare problems, such as reduced access to cancer treatments, increased anxiety, mental exhaustion, and backlogged waitlists6,7,8.

Significant differences in responses and outcomes across countries with varying death and recovery rates underscore the need for systematic approaches to assess and benchmark NHCS performance. For instance, France and Germany exhibited vastly different death and recovery rates during the pandemic’s first 100 days despite similar geographic and socio-economic conditions9. Similarly, Canada and the USA showed significant NHCS performance and efficiency variations despite substantial resource allocation (see Fig. 1). This has motivated researchers to develop methodological approaches to compare NHCS performance during global pandemics like COVID-19.

Figure 1
figure 1

Recovery and death rates, and healthcare resources from selected OECD countries: Canada, France, Germany, Italy, New Zealand, South Korea, and the USA: (a) daily cumulative recovery rates, (b) daily cumulative death rates, (c) resources of NHCS.

International comparisons of NHCSs improve health services, enforce accountability, share strategies, and encourage mutual learning10. However, such comparisons face significant challenges, including defining the boundaries of the health system, selecting performance metrics, ensuring comparability across borders, and accounting for dynamic responses over time. The complexity is compounded by the tension between assessing the entire system versus fragmented comparisons of health units (e.g., hospitals). Various international organizations, such as the World Health Organization, the Organization for Economic Co-operation and Development (OECD), and the European Union, have adopted system assessments to influence public policies and accelerate debates about NHCS performance10.

Interest in comparing international health systems is growing as NHCSs face increasing pressure to improve performance10,11. However, defining the unit of analysis is challenging. According to the World Health Organization (WHO)12, a health system encompasses all organizations, agents, and actions involved in healthcare provision, public health promotion, and protection. Defining national health system boundaries is complex due to varying perspectives and levels of analysis13,14. Additionally, the objectives and terminology used to describe healthcare systems are often ambiguous15,16.

Comparing NHCS performance during a global pandemic is a complex and challenging task. The pandemic simultaneously impacts health, social, and economic systems, making isolating healthcare system effects difficult. Moreover, there needs to be standardized criteria or performance metrics for comparing NHCS performances, adding to the complexity. The dynamic nature of NHCS responses, which evolve as systems learn and adapt, further complicates comparisons. This complexity presents an intriguing challenge for researchers, policymakers, and professionals in health systems research, health policy analysis, and health economics.

Comparing health systems’ performance is a complex task studied extensively in fields such as health systems research, health policy analysis, and health economics13,17,18. Additionally, operational research techniques address issues like health system bottlenecks and low hospital performance19. These techniques have also been pivotal in solving hospital and health system challenges related to COVID-1920,21,22,23,24. However, unlike these recent works, our proposed approach first combines fuzzy clustering to group countries into homogeneous and distinct clusters, taking into account their population health similarities. Then, we propose a targeted application of longitudinal Data Envelopment Analysis (DEA)25 to comparable countries within each cluster. Finally, we propose a qualitative bi-criteria analysis to capture the dynamic the performance of each country’s response over time considering efficiency and clinical performances.

This study introduces an innovative bi-criteria method to dynamically compare NHCS responses to the COVID-19 outbreak. The proposed method, which combines fuzzy clustering and longitudinal data envelopment analysis (DEA), provides a comprehensive and dynamic assessment of NHCS performance and efficiency during the outbreak phase. Clinical outcomes are the first criterion for assessing NHCS performance. At the same time, longitudinal DEA evaluates response efficiency with five inputs representing the resources available to the NHCS and two outputs representing the ratios of recovered and the number of survived cases to the total number of infected individuals in a population. The algorithm categorizes each NHCS as an efficient-performer, inefficient-performer, efficient-under-performer, or inefficient-under-performer for the pandemic’s first 100 days. The study visually represents performance dynamics over time, highlighting each cluster’s best and worst performers during the early outbreak phase. This innovative algorithm has potential applications in various domains and levels of analysis, particularly in complex performance evaluation and benchmarking. The implications for performance assessment and benchmarking are discussed theoretically and practically, demonstrating the practicality and innovation of this approach for researchers, policymakers, and professionals in health systems research, health policy analysis, and health economics.

This paper is structured as follows: Related work section provides a review of the related literature. Methodology: qualitative bi-criteria approach for the comparison of NHCS section, the methodology, develops the bi-criteria approach for comparing NHCSs. Results section presents the empirical results for the first 100 days of the pandemic. Discussion and managerial implications section discusses the theoretical and practical implications of the work. Finally, conclusion in section presents the conclusions, summarizing the essential findings and their implications.

Related work

Comparing the performance of health systems is a persistent challenge in several streams of literature13,17,18. With the COVID-19 pandemic, the need for systematic approaches to assess and benchmark national health systems (NHCS) performance has become even more pressing10. The global nature of the pandemic provides a unique context for comparing NHCS responses facing almost the same disruption simultaneously.

Our study compares the responses of NHCSs from 37 member countries of the Organization for Economic Co-operation and Development (OECD), driven by the availability of reliable data. We acknowledge that data from other countries may be less readily available or trustworthy, and contextual differences can introduce further complications. For example, populations in different countries may be vulnerable to COVID-19, with front-line healthcare workers at a higher risk of infection26. Additionally, demographic groups such as women have lower infection and mortality rates than men27. The mean age of death due to COVID-19 is around 82 years28,29,30, and pre-existing comorbidities are associated with higher mortality rates31,32,33,34. Furthermore, populations with higher vulnerability to COVID-19 are reported to require more intensive care29 and experience higher mortality rates28.

Our work builds on existing literature evaluating NHCS, which typically considers three main performance criteria: access to healthcare services, quality of care, and efficiency (cost)35,36. However, there needs to be more clarity regarding the definition of these domains, how they overlap, and how they relate to critical health system objectives15,16,37. Additionally, the distribution of performance objectives across different population groups adds complexity to evaluating healthcare performance38.

Two primary methodological approaches have been used to assess NHCS efficiency. The first is the ratio analysis approach, based on individual partial indicators often developed as ratios to provide different levels of efficiency39,40. This approach can lead to conflicting results due to the need to calculate multiple ratios to capture various performance dimensions41. The second approach uses non-parametric econometric techniques like DEA, which computes a comprehensive efficiency index42,43,44,45. Our work is grounded in this second stream, as DEA has been widely used to assess the relative performance of healthcare organizations and identify ways to improve their efficiency46. For example, DEA has been applied to estimate the technical efficiency of hospitals in Turkey47, assess healthcare center performance17, evaluate health insurance reform in China48, and measure the efficiency of rural hospitals in Missouri49. DEA has also been used to assess the performance of healthcare systems during the COVID-19 pandemic19,50,51,52,53,54,55.

Comparing NHCS responses to COVID-19 is a multi-criteria problem, requiring consideration of multiple performance dimensions. Multiple criteria decision analysis (MCDA) addresses complex decision-making problems involving conflicting and non-commensurable evaluations56,57,58. Most MCDA methods construct representations in the preference space and use different preference aggregation procedures to synthesize partial preference structures into an aggregated preference structure59,60. However, the comparison of NHCS does not fit within traditional MCDA literature56,57,61 for several reasons: there is no single decision-maker guiding the analysis, the output is not a traditional decision problem, and it is crucial to capture the dynamics of NHCS performance over time.

No existing method combines longitudinal DEA and fuzzy clustering to create a qualitative, bi-criteria decision analysis framework for comparing NHCS responses to the COVID-19 pandemic. Our approach addresses this gap by providing a comprehensive and dynamic NHCS performance and efficiency assessment.

Methodology: qualitative bi-criteria approach for the comparison of NHCS

We present a novel visual and qualitative bi-criteria method for comparing the dynamic responses of national healthcare systems (NHCSs) to the COVID-19 pandemic. To address the challenges inherent in such comparisons, we propose a conceptual model of an NHCS, illustrated in Fig. 2. This model encapsulates the fundamental structural components of the behavioural and ecological model of health services, positioning the NHCS as a core element of the national health system. It comprises four interdependent, integrated, and dynamic components: (1) governance of the national health system, (2) public health, (3) NHCS, and (4) outcome management (accountability).

Figure 2
figure 2

Conceptual model of an NHCS in response to the COVID-19 pandemic.

The model posits that national governance, encompassing public health strategy, ecosystem, leadership, partnerships, investment, supply chain, and information management, defines public health policies, programs, and actions while managing funding, risk, and accountability62. Public health then formulates, adapts, and enforces policies to contain the virus, improve population immunization, and enhance testing and tracing efforts. Concurrently, national governance directs the NHCS’s preparedness and response to the pandemic.

The model further suggests that the complex interactions between social and environmental factors—such as predispositions, behaviours, and needs—and the virus’s spread influence the severity of cases managed by the NHCS. It builds on the premise that various public health factors moderate the impact of the virus on population health and access to healthcare resources63. In response to the COVID-19 outbreak, we conceptualize an NHCS as a collection of institutions tasked with restoring and preserving health. The primary capabilities of an NHCS in responding to COVID-19 include providing access to acute and long-term care and protecting frontline workers.

Our visual and qualitative bi-criteria method for comparing NHCS responses to the COVID-19 pandemic is specifically designed to address the unique challenges of such comparisons, including the lack of homogeneity among NHCS units, the multidimensional nature of performance metrics, and the need to capture performance dynamics over time. The NHCS is treated as a decision-making unit (DMU) in this analysis. The Algorithm 1 integrates longitudinal DEA and fuzzy clustering to develop an objective and interpretable framework.

Algorithm 1
figure a

Qualitative bi-criteria approach.

Longitudinal data envelopment analysis

We use DEA to compare the efficiency of DMUs. DEA uses linear programming to create an efficiency frontier by optimizing the weighted output-to-input ratio of each DMU 64.

Definition of inputs: NHCS preparedness and COVID-19 cases

Resources available to NHCS are the inputs. Resource shortages limit the ability of an NHCS to respond to the pandemic65,66. The global nature of the crisis means that “we are all in it together”, and access to multilateral support and assistance is unlikely. Therefore, we assume that resource preparedness of an NHCS for the pandemic is a critical determinant of its performance. The recent literature on COVID-19 identifies five resource dimensions (\(X_1\) to \(X_5\)) of NHCS preparedness for the COVID-19 pandemic (see Table 1), which are \(X_1\) to \(X_5\). The other input considered in this study is the longitudinal data on the cumulative number of confirmed cases per day for each country (\(X^t_6\)). A confirmed COVID-19 case is a person with laboratory confirmation of COVID-19 infection based on a validated and approved test performed in the community at a hospital or reference laboratory67.

Table 1 Input dimensions for longitudinal data envelopment analysis.

Definition of outputs: NHCS clinical outcomes

We consider two outputs: survival and recovery rates. The first longitudinal output \(y_1(DMU_i,t)\) is the infection recovery rate (IRR) at time t, which indicates what percentage of total diagnosed positive cases are cured of the disease (see Eq. 1).

$$\begin{aligned} y_1(DMU_i,t) = IRR(DMU_i,t) = \dfrac{(\text{Total} \text{ number} \text{ of} \text{ recovered} \text{ cases})^t_i}{(\text{Total} \text{ number} \text{of} \text{ confirmed} \text{ cases})^t_i} \end{aligned}$$
(1)

for a country i at time t. The \(IRR(DMU_i,t)\) is an estimated metric because no country has succeeded in testing the entirety of its population.

We use the case fatality rate (CFR) at time t to estimate the death rate (see Eq. 2).

$$\begin{aligned} CFR(DMU_i, t) = \dfrac{{(\text{Total } \text{ number } \text{ of } \text{ deaths})}^t_i}{{(\text{Total } \text{ number } \text{ of } \text{ confirmed } \text{ cases})}^t_i} \end{aligned}$$
(2)

for a country i at time t. We define the survival rate as the second longitudinal output \(y_2(DMU_i,t)\) (see Eq. 3).

$$\begin{aligned} y_2(DMU_i,t) = 1 - CFR(DMU_i,t) \end{aligned}$$
(3)

It is worth emphasizing that CFR does not involve a lag, as we are interested in the instantaneous response of an NHCS.

Longitudinal DEA

We focus on the response of the healthcare system to the outbreak of COVID-19. The inputs are defined by the preparedness dimensions and COVID-19-related demand for healthcare services, while the outputs are specified by the clinical outcomes. Given that the outputs do not necessarily change in a proportional way to the inputs, we assume variable returns to scale (VRS) technology43,72. However, the inputs and outputs vary over time during the pandemic. Therefore, we expect the efficiency of an NHCS to change over time. The longitudinal DEA is, therefore, introduced to capture these changes (as shown in Fig. 3). The efficiency frontier at time \(t=1\) is defined by NHCSs 1 to 6. At time \(t=2\), the efficiency frontier may change as NHCSs improve their institutional or technical performance, which explains the efficiency frontier shift.

Figure 3
figure 3

Principle of longitudinal DEA.

Let \((X^t,Y^t)\) be two longitudinal matrices representing inputs and outputs, respectively. As the study’s objective revolves around the NHCS’s performance in using its resources, an input-oriented VRS model73 is adopted to measure the efficiency of each \(DMU_i\) for \(i=1,..., 37\) at time t. Given that we are dealing with ratio outputs, we adopt the model proposed by Emrouznejad et al.74 to take into account the ratio nature of the two outputs. The model is given in Eq. 4, with \(t=1,...,T\). To preserve the discriminatory power of the DEA75, as explained Bi-criteria visual analysis of efficiency and performance of NHCS in section, we make sure that the number of DMUs is higher than the numbers of inputs and outputs.

$$\begin{aligned} \begin{aligned} \underset{\theta ^t,\lambda ^t}{\text {min}} \{\theta ^t\} \\ \text {subject to}{} & {} \\{} & {} \theta ^t x^t_0 -X^t\lambda ^t \ge 0\\{} & {} N^t \lambda ^t -y^t_0 \ D^t \lambda ^t \ge 0 \\{} & {} e^T\lambda ^t = 1 \\{} & {} \lambda ^t \ge 0 \\ \end{aligned} \end{aligned}$$
(4)

With \(N^t\) and \(D^t\) are the longitudinal matrices of the numerator and denominator for the ratio output \(Y^t\) at time \(t=1,...,T\), i.e. for each DMUi, \(i=1...37\):

$$\begin{aligned} y_1(DMU_i,t) = \dfrac{(\text{Total } \text{ number } \text{ of } \text{ recovered } \text{ cases})^t_i}{(\text{Total } \text{ number } \text{ of } \text{ confirmed } \text{ cases})^t_i}= \dfrac{n_1(DMU_i,t)}{d(DMU_i,t)} \end{aligned}$$
(5)

and

$$\begin{aligned} y_2(DMU_i,t) = \dfrac{1-(\text{Total } \text{ number } \text{ of } \text{ deaths})^t_i}{(\text{Total } \text{ number } \text{ of } \text{ confirmed } \text{ cases})^t_i}= \dfrac{n_2(DMU_i,t)}{d(DMU_i,t)} \end{aligned}$$
(6)

Various extensions have been proposed to overcome the limitations of the original standard DEA models. For instance, Aleskerov and Petrushchen 76 suggest that non-homogeneous DMUs may present challenges when interpreting results. One solution proposed in the literature is to use a combination of clustering and DEA to evaluate the efficiency and productivity of DMUs 77. For example, Dai and Kuosmanen 78 applied a combination of clustering and DEA to identify DMUs with similar characteristics, resulting in more efficient and meaningful results than traditional methods. Similarly, other studies have combined k-means clustering with DEA to evaluate public hospitals 79 and applied a fuzzy clustering cooperative game with DEA to assess hospital efficiency 80. Zarrin et al. 81 used a combination of self-organizing map artificial neural network, cluster analysis, and slacks-based measure DEA with multilayer perceptron neural network to analyze the efficiency of over 1100 hospitals in Germany.

The assessment and comparison of NHCS performance over time is crucial in understanding how different countries are responding to the pandemic outbreak. While various metrics exist to assess the performance of DMUs as documented by82, the Malmquist index (M) offers distinct advantages. Unlike traditional metrics that might only compare a unit’s efficiency to an average performer, the M index leverages the concept of a production frontier to enable a more rigorous benchmark. Additionally, some metrics solely capture efficiency changes, overlooking advancements due to technological progress. The Malmquist index addresses this shortcoming by incorporating both technical change and efficiency change, as highlighted by83. Consequently, the Malmquist index provides a richer and more nuanced picture of performance change over time, making it a valuable tool for DEA researchers to dynamically evaluate DMUs. The conditions of the pandemic, combined with the adaptation to change, may result in performance gains or losses over time, depending on how NHCS adapt to various external and internal factors. To measure the change in productivity of each NHCS over time, we use the Malmquist productivity index M. This method compares the relative performance of an NHCS with respect to reference states from two different time periods. It also allows for decomposition of productivity change into institutional efficiency change and technical change84, making it particularly useful for analyzing healthcare performance over time85,86.

To analyze the change in performance between two periods, we denote the efficiency scores for each period as \(\theta ^{1,1}\) and \(\theta ^{2,1}\), where the first exponent refers to the period and the second to the state of technology. The MTEC, which measures the catch-up effect between two consecutive periods, is given by Eq. (7). The MTC, which measures the frontier-shift effect or the change in the reference frontier between two consecutive periods, is given by Eq. (8). Using the period 2 observation as the reference benchmark to evaluate the shift in the frontier25, the Malmquist productivity index (M) can be calculated using Eq. (9).

$$\begin{aligned} MTEC= & {} \theta ^{2,1}/\theta ^{1,1} \end{aligned}$$
(7)
$$\begin{aligned} MTC= & {} \Bigg [\bigg (\frac{\theta ^{2,1}}{\theta ^{2,2}}\bigg )*\bigg (\frac{\theta ^{1,1}}{\theta ^{1,2}}\bigg )\Bigg ]^{1/2} \end{aligned}$$
(8)
$$\begin{aligned} M= & {} MTEC*MTC = \bigg (\frac{\theta ^{2,2}}{\theta ^{1,1}}\bigg )*\Bigg [\bigg (\frac{\theta ^{2,1}}{\theta ^{2,2}}\bigg )*\bigg (\frac{\theta ^{1,1}}{\theta ^{1,2}}\bigg )\Bigg ]^{1/2} \end{aligned}$$
(9)

Finally, we utilize the Malmquist productivity index M to assess and compare the performance of NHCSs over time. This approach captures the dynamics of the response to the pandemic in different countries and allows for an analysis of performance gains or losses over time. Additionally, we decompose the long-term productivity change into consecutive sub-periods using the transitivity property of the Malmquist index. This enables us to study the temporal performance and efficiency evolution of NHCSs in response to COVID-19 outbreak through a combination of graphical analysis and the Malmquist method.

Fuzzy clustering: controlling for the population vulnerability

The vulnerability of a population to COVID-19 plays a significant role in determining the demand for national healthcare services. Countries with populations that are more vulnerable to the virus, such as those with a larger aging population, may require more resources to provide care and may experience higher mortality rates. To account for this, we use a clustering method to group countries with similar levels of population vulnerability to the pandemic. This allows us to control for the moderating effect of vulnerability on the demand for healthcare resources. The literature identifies 15 dimensions of vulnerability to COVID-19, which are detailed in Table 2.

Table 2 Dimension of vulnerability of the population to the COVID-19.

The FJM approach is a fuzzy clustering method that is used in this study to NHCSs into homogeneous groups based on the vulnerability of their respective populations to COVID-19. FJM is a variation of the popular fuzzy clustering method, Fuzzy C-Means, which addresses the limitations of traditional clustering methods by allowing for complex relationships between entities and reducing data noise. FJM uses all possible centroid-to-pattern relocations in order to construct move-defined neighborhoods. This allows for a more efficient optimization of the membership degrees and centroids, resulting in better solutions compared to traditional hard clustering methods such as k-means96. The FJM method has been proven to provide better results in multiple machine learning and pattern recognition applications97,98,99. The FJM algorithm used in this study is summarized in Algorithm 296,97 and the python code is available on GitHub at https://github.com/nbelacel/FuzzyJMeansVNS. The output of the FJM method is a matrix of membership degrees that describes the level of similarity between each NHCS and each cluster centroid, allowing for the assignment of NHCSs to multiple clusters simultaneously with different membership degrees.

Algorithm 2
figure b

The fuzzy clustering algorithm FJM.

Bi-criteria visual analysis of efficiency and performance of NHCS

The performance of an NHCS in response to COVID-19 outbreak is conceptualized based on the clinical outcomes: maximizing the COVID-19 recovery rate and minimizing the COVID-19 death rate. The efficiency is measured using the DEA method. Together, these two categories define the qualitative multiple criteria map shown in Fig. 4. Thus, four cases can be considered for the \(DMU_i\). First, a \(DMU_i\) is considered an efficient-performer if it outperforms similar countries in terms of clinical outcomes and it is efficient regarding its DEA efficient score at time t. Secondly, \(DMU_i\) is considered an efficient-under-performer, if it under-performs similar countries in terms of clinical outcomes and it is efficient at time t. Third, a \(DMU_i\) is considered an inefficient-performer, if it outperforms similar countries in terms of clinical outcomes but it is inefficient at time t. Finally, a \(DMU_i\) is considered an inefficient-under-performer, if it underperforms similar countries in terms of clinical outcomes and it is inefficient at time t. We have used visualization techniques to compare the performance and efficiency of different NHCSs in response to COVID-19 outbreak. To further clarify, we have established the following qualitative categories of performance (see Figs. 10, 11, 12, 13)):

$$\begin{aligned}{} & {} if {\left\{ \begin{array}{ll} IRR(DMU_i,t) \ge IRR(V_c,t)\\ \text { and }\\ CFR(DMU_i,t)\le CFR(V_c,t) \end{array}\right. } \Rightarrow \text {Performer} DMU_i \end{aligned}$$
(10)
$$\begin{aligned}{} & {} if {\left\{ \begin{array}{ll} IRR(DMU_i,t) < IRR(V_c,t) \\ \text { and }\\ CFR(DMU_i,t) > CFR(V_c, t) \end{array}\right. } \Rightarrow \text {Under-performer } DMU_i \end{aligned}$$
(11)
$$\begin{aligned}{} & {} if {\left\{ \begin{array}{ll} IRR(DMU_i,t) < IRR(V_c,t) \\ \text { and }\\ CFR(DMU_i,t) \le CFR(V_c, t) \end{array}\right. } \Rightarrow \text {Quasi-performer } DMU_i \end{aligned}$$
(12)
$$\begin{aligned}{} & {} if {\left\{ \begin{array}{ll} IRR(DMU_i,t) \ge IRR(V_c,t)\\ \text { and }\\ CFR(DMU_i,t) > CFR(V_c, t) \end{array}\right. } \Rightarrow \text {Pseudo-performer} DMU_i \end{aligned}$$
(13)

where \(IRR(V_c,t)\) and \(CFR(V_c, t)\) are the average recovery and death rates at time t for the respective cluster \(V_c\) of similar countries. Similarly, we define the efficiency of \(DMU_i\) at time t based on DEA outputs as follows25:

$$\begin{aligned}&DMU_i\text { is DEA efficient at time } t \Leftrightarrow \theta ^t(DMU_i) = 1 \nonumber \\ {}&\text { and } \lambda ^t(DMU_i) = 0 \text { for all inputs} \nonumber \\&DMU_i\text { is DEA inefficient at time } t \Leftrightarrow \theta ^t(DMU_i) < 1 \nonumber \\&DMU_i\text { is DEA weakly efficient at time } t \Leftrightarrow \theta ^t(DMU_i) = 1 \nonumber \\ {}&\text { and } \lambda ^t(DMU_i) \ne 0 \text { for some inputs}. \end{aligned}$$
(14)

Slacks \(\lambda ^t(DMU_i)\) and efficiency \(\theta ^t(DMU_i)\) are obtained by solving Eq. 4 at each time period t.

Figure 4
figure 4

Qualitative bi-criteria visual matrix.

Results

Experimentation platform

The experimentation platform and collaborative environment used in this study were set up on the cloud, with resources provided by WestGrid and Compute Canada under the COVID-19 resource allocation fast track. A total of 80 virtual CPUs, 300 GBs of RAM, and 10 TBs of persistent storage were allocated to us on the Arbutus cloud. This platform includes an online operational research toolbox that can run various data analytic algorithms, accessible via the website: (https://heacovid.uvic.ca/explorer/). Additionally, the platform is designed to be flexible for different levels of analysis (e.g. micro and miso) and can be extended to other jurisdictions.

Data collection and descriptive statistics

We collected, consolidated, and combined data from multiple trusted secondary data sources including curated online databases. The data is structured according to three categories: preparedness for the pandemic, population vulnerability to COVID-19 and clinical outcomes. We retrieved data related to preparedness from the World Bank database100, the number of intensive care unit (ICU) beds data from the Statista101 and healthcare expenditure per capita from the OECD dataset9. We gathered the number of populations suffering from asthma and chronic obstructive pulmonary disease from the Lancet Respiratory Medicine journal102, data on diabetes, smoking and cancer from the World Population Review103, cardiac disease, vascular disease, obesity and hypertension from the European Heart Network website104 and the data on chronic kidney disease was collected from105. The urban density and statistics on the population above 65 years of age were collected from the World Bank database106 and the Our World in Data website107, respectively. Additional data on preparedness and population vulnerability were collected from108,109 and110. The COVID-19 related data was collected from Harvard University’s Dataverse111 and the WHO’s Global research on coronavirus disease”112. Finally, we used the United Nations Educational, Scientific and Cultural Organization (UNESCO) database to retrieve countries’ demographic and contextual data113.

In this study, we used more than 23,000 data points collected from various trusted data sources. Descriptive statistics for the 5 preparedness and 15 vulnerability dimensions of the 37 OECD countries are presented in Table 3. We collected time series data on confirmed COVID-19 death and recovery cases for each OECD country from January 14, 2020 to July 11, 2020. We observed that the first few days of each time series included high variability due to the low number of cases reported. Therefore, we truncated the time series for each country by removing the first few days corresponding to less than 100 cumulative confirmed cases. The first day of each time series is determined when the number of confirmed cases is at least 100. Consequently, day 1 of the pandemic is consistent for all OECD countries. Additional descriptive statistics are provided in tables: table S1, table S2, table S3, table S4 and table S5; in the supplementary document.

Table 3 Descriptive statistics for the preparedness and population vulnerability to COVID-19 dimensions in the OECD countries (per 100,000 people).

Clustering countries in homogeneous groups with similar vulnerability to COVID-19

We use a clustering method called Fuzzy J-Means (FJM) to classify national healthcare systems (NHCSs) into homogeneous groups based on their population’s vulnerability to COVID-19 (see Table 2). The FJM method, similar to other clustering techniques requires the number of clusters in the data to be pre-specified (see fuzzy clustering: controlling for the population vulnerability in section. Fuzzy Clustering). As such, the performance of the clustering method may be affected by the chosen value of k. In this study, we employed an objective evaluation measure based on the objective function of the fuzzy clustering problem 96. After several runs, we found that \(k=5\) was the most relevant number of clusters. The FJM algorithm then produced the final clustering results for the OECD countries, as summarized in Table 4. Detailed computations of the membership degrees of each country to each cluster can be found in Table S6 in the supplementary document.

Table 4 Clustering results of countries based on their population vulnerabilities.

Figure 5 presents a visualization of the population vulnerability profiles of each country in each cluster based on the 15 dimensions. This figure serves as a graphical validation of the results obtained from the fuzzy clustering approach. It is evident that clusters 2 to 5 consist of countries with similar vulnerability profiles, while Chile and Japan (cluster 1) are outliers. As these two countries are not representative of the other clusters, they are not included in the further analysis to ensure the validity of any efficiency and performance analysis.

Figure 5
figure 5

Mapping of population vulnerability profiles for each cluster.

Additionally, it is also noteworthy that, besides the FJM algorithm, several other clustering algorithms were tested using our experimentation platform, including the k-Means and hierarchical algorithms. Ultimately, we found that the fuzzy clustering FJM algorithm provided more accurate results.

Clinical performances of OECD countries’ NHCS in response to COVID-19

We used the results from the clustering to compare the clinical performance of NHCSs within the same group. Figures 6, 7, 8, and 9 show the comparison of countries’ clinical performance within each cluster after 25 and 111 days of the pandemic being declared (i.e., after registering more than 100 cumulative confirmed cases). The pre-clustering figures are provided in Appendix B. The yellow dashed lines represent the average daily death and recovery rates for each cluster. For example, New Zealand is clearly a strong performer among its peers in cluster 2 after 111 days, while Spain is an under-performer. Germany is a leading performer in its group in cluster 3, while the USA is a quasi-performer. Both France and Italy under-perform their peers in cluster 4. Canada is an under-performer among its peers in cluster 5, while South Korea is a clear leader in the same group (Fig. 9).

Figure 6
figure 6

Clinical performance comparison among cluster 2 countries after 25 and 111 days of the pandemic.

Figure 7
figure 7

Clinical performance comparison among cluster 3 countries after 25 and 111 days of the pandemic.

Figure 8
figure 8

Clinical performance comparison among cluster 4 countries after 25 and 111 days of the pandemic.

Figure 9
figure 9

Clinical performance comparison among cluster 5 countries after 25 and 111 days of the pandemic.

Longitudinal efficiency of NHCS responses to the outbreak of COVID-19 pandemic

The results of our analysis using the DEA model for each cluster are shown in Fig. 10. This model compares the relative efficiency of countries in using their inputs to achieve outputs (for more details on DEA models, see longitudinal data envelopment analysis in section). In the first 100 days of the pandemic, the efficiency index for each country ranges from 0.30 to 1 (100%). With some exceptions, most countries generally improved their efficiency, though it remained low overall. Differences in efficiency averages between clusters are illustrated in Fig. 11. Clusters 2 and 5 have similar averages of 0.76 and 0.77, respectively, showing a loss of efficiency in the first 20 days of the pandemic, followed by gradual improvement. Clusters 3 and 4 have lower averages of 0.68 and 0.70, respectively, showing a loss of efficiency for over 40 days before improving.

Figure 10
figure 10

DEA longitudinal results for clusters 2 to 5.

Among the countries, New Zealand and South Korea stand out as leaders, demonstrating exemplary clinical outcomes and efficiency. Their success serves as a potential model for other countries to learn from. Spain, despite reaching the efficiency frontier early, maintained a stable efficiency slightly above 0.5. In cluster 3, most countries struggled with unstable and low-efficiency levels, ranging between 0.34 and 1. The USA, initially on the efficiency frontier, experienced a drop to 0.63. Germany, despite its clinically solid performance, had an efficiency level of 1 for 14 days, then dropped to only 0.34 and then gradually improved, with an average of 0.66 due to under-utilized resources. France and Italy in cluster 4 had similar behaviours, achieving average efficiencies of 0.55 and 0.53, respectively. South Korea in cluster 5, however, reached the optimal efficiency frontier after 26 days, indicating effective use of NHCS resources. Canada, a consistent performer only at the beginning of the pandemic, achieved efficiency periodically and maintained stable efficiency levels of around 0.79 throughout the pandemic, with an average efficiency of 0.82.

Figure 11
figure 11

Longitudinal average efficiency index for each cluster.

The Malmquist index captures the changes in efficiency over time, as shown in Fig. 12. A value higher than 1 for the Malmquist index at time t indicates increased productivity. In contrast, a value less than 1 implies a decline in productivity compared to the previous period (\(t-1\)). Spain and Italy exhibit similar patterns, with significant changes in productivity, likely due to being among the first countries to experience a high number of cases and not being fully prepared with the necessary resources to handle the pandemic. The Malmquist index exceeded 14 for Spain and reached 2.75 for Italy during the first fourteen days. This variation is primarily driven by an increase in the Malmquist institutional efficiency change (MTEC) index during this period, which measures the catch-up effect between two consecutive periods, (\(t-1\)) and t. Conversely, the drop in productivity on day seven and the subsequent increase on day eight are attributed to fluctuations in the Malmquist technical change (MTC) index, which measures the frontier-shift effect or the change in the reference frontier between two consecutive periods (\(t-1\)) and t, as shown in Fig. 13.

Figure 12
figure 12

Malmquist index for each cluster.

Two weeks after the detection of the 100th case, productivity levels stabilized. Notably, France’s productivity variation differs from Italy’s and Spain’s, remaining flat at around 1, indicating no change in the system’s productivity. The productivity levels of New Zealand and South Korea also show noteworthy behaviour. South Korea had the slightest fluctuation in productivity, which may be due to its ability to manage and control the pandemic effectively. The USA’s productivity averaged around 0.93, mainly due to a drop in the MTC (see Fig. 13). Canada’s Malmquist index stabilized at 0.98; however, unlike the USA, which had maximum productivity of 1.05, Canada’s productivity level reached as high as 2.24 before quickly dropping.

Figure 13
figure 13

Malmquist index decomposition for selected countries.

In some cases, productivity variation is due to data reporting inconsistencies rather than objective factors (e.g., most of the productivity gain in Mexico is caused by fluctuations in the number of reported cases). It is important to note that our analysis is based on publicly available data, which may have limitations in terms of accuracy and consistency. The Malmquist index is a complex metric that may only partially capture some aspects of productivity changes. These factors should be considered when interpreting our findings.

Bi-criteria comparison of NHCS responses to the outbreak of the COVID-19 pandemic over time

The results of the qualitative multi-criteria analysis for each country are shown in Fig. 14 (see bi-criteria visual analysis of efficiency and performance of NHCS in section and Algorithm 1). Figure 14 presents the efficiency-performance paths for selected countries from each cluster: New Zealand and Spain from Cluster 2, Germany and the USA from Cluster 3, Italy and France from Cluster 4, and Canada and South Korea from Cluster 5.

Figure 14
figure 14

Qualitative analysis for selected countries: (a) New Zealand’s NHCS maintained good performance over time but Spain’s was an under performer in cluster 2, (b) Germany’s NHCS outperforms USA’s in cluster 3, (c) Italy’s NHCS and France’s seems to follow similar patterns in cluster 4, and (d) South Korea’s NHCS seems to outperform Canada’s in cluster 5.

Analyzing Canada’s performance, we observe significant fluctuations. On day 1 of the pandemic, Canada effectively led the pack but quickly lost its edge. By day 13, it was considered a pseudo-performer among its peers in cluster 5 while simultaneously being on the efficiency frontier. However, by day 14, Canada had become an inefficient performer. On day 19, it regained pseudo-efficient performer status in cluster 5 but lost this advantage by day 25, becoming an under-performer and inefficient. This decline could be attributed to an increase in fatalities in long-term care homes. On the other hand, Germany began as an efficient under-performer and, by day 16 of the pandemic, had moved to the category of inefficient performers within its cluster. New Zealand consistently outperformed its peers, suggesting it should be studied for best practices. Additionally, Figure S1 in the supplementary document illustrates the dynamic nature of NHCS responses to the pandemic over time in each country. For more detailed information, readers are referred to the web platform (https://heacovid.uvic.ca/explorer/) to explore the qualitative multi-criteria map for each country and cluster.

Discussion and managerial implications

In this study, we present a novel approach to comparing national healthcare systems’ (NHCS) responses to the COVID-19 pandemic by clearly defining the boundaries of the unit of analysis, which in our case refers to the entire healthcare system of a country. Our proposed model highlights the essential capabilities for effective response: acute and long-term care and protecting front-line workers.

We introduce a bi-criteria algorithm that integrates longitudinal DEA and fuzzy clustering, a method that allows for the classification of data points into clusters based on their similarity. This algorithm evaluates NHCS performance based on two critical criteria: efficiency, which measures the system’s ability to deliver healthcare services with minimal resource wastage, and clinical outcomes, which assess the system’s effectiveness in improving patient health. The study focuses on OECD countries due to the availability of reliable data, but the proposed methodology can be extended to other regions as data quality improves.

Our validation using OECD data revealed significant variability in NHCS responses, with several lessons emerging for improving future pandemic responses. The analysis confirms that OECD countries exhibited varied performance levels during the COVID-19 outbreak. Our ‘mixed methodology’, which combines quantitative and qualitative approaches, enables meaningful comparisons among healthcare systems serving similar populations by controlling for population vulnerability. Exemplary performers included New Zealand, Germany, Austria, and South Korea.

The study’s results indicate that not all countries reached the relative efficiency frontier within the first 100 days of the pandemic, experiencing different performance dynamics. The combined efficiency and productivity variation analyses suggest that many healthcare systems must utilize resources optimally. Productivity gains were primarily driven by increased cases rather than efficient resource use, pointing to initial under-utilization. The premature closure of non-urgent healthcare services exacerbated inefficiencies and created backlogs, highlighting the need for more adaptive strategies.

Our findings also underscore disparities in protecting frontline workers. Effective measures in countries like South Korea, Japan, and Taiwan contrast sharply with struggles in the United States and the United Kingdom. This protection disparity underscores the critical importance of safeguarding healthcare workers during crises.

Regarding acute and long-term care, Germany, Austria, and New Zealand demonstrated effective responses, expanding intensive care units, increasing hospital bed capacity, and ensuring adequate staffing. These measures facilitated better patient influx management and underscored the importance of scalable healthcare infrastructure.

The study’s findings emphasize the necessity of rethinking pandemic response strategies and redesigning healthcare service delivery in alignment with public health policies. However, these findings also present a unique opportunity for improvement. By implementing the recommendations outlined in this study, healthcare systems can enhance their preparedness and response to future pandemics. For instance, adaptive strategies could include the establishment of flexible healthcare facilities that can be quickly converted into isolation wards during a pandemic, or the development of dynamic staffing plans that can be adjusted based on the severity of the outbreak. These examples inspire a sense of optimism and motivation in the audience.

We propose the following critical managerial implications:

  • Healthcare systems should adopt dynamic and collaborative forecasting and planning approaches with rolling planning horizons to anticipate and respond to changing demands effectively.

  • Efficient resource utilization is crucial. Systems should avoid abrupt service cancellations and adopt ’phased approaches’ to manage capacity without compromising other healthcare services. For example, hospitals could prioritize urgent surgeries and procedures during the pandemic’s peak while gradually resuming non-urgent services as the situation improves.

  • Protecting front-line workers should be a priority, with adequate personal protective equipment (PPE), mental health support, and clear protocols to ensure their safety and well-being. Several countries registered high numbers of fatalities in their frontline work force.

  • Investment in scalable healthcare infrastructure, which refers to the ability of a healthcare system to rapidly expand its capacity in response to increased demand, including ICU capacity, hospital beds, and staffing, is essential to effectively managing patient surges during pandemics.

  • Aligning healthcare delivery with public health policies ensures cohesive and coordinated responses. This includes integrating public health strategies with healthcare operations to manage outbreaks comprehensively.

Conclusion

Our study introduces a novel approach, a bi-criteria framework, to understand the performance dynamics of national healthcare systems (NHCSs) during the COVID-19 pandemic. This framework, which integrates DEA and fuzzy clustering, provides a comprehensive and dynamic assessment of NHCS performance. By applying this methodology, we uncover significant variability in NHCS responses among OECD countries and identify key factors influencing efficiency and performance.

Our findings carry significant practical implications, revealing that not all countries reached the relative efficiency frontier within the first 100 days of the pandemic. Many experienced fluctuating performance dynamics. Exemplary performers such as New Zealand, Germany, Austria, and South Korea demonstrated effective resource utilization and strong clinical outcomes. Conversely, countries like Spain and Italy exhibited significant initial productivity gains followed by declines, underscoring the challenges of sustaining effective responses. The disparities in protecting frontline workers and managing acute and long-term care highlight critical areas for improvement.

Implementing adaptive strategies, efficient resource utilization, data-driven decision-making, international collaboration, targeted support for vulnerable populations, and continuous training will enhance the resilience and effectiveness of healthcare systems worldwide. These recommendations are not just words on paper, they inspire hope and motivation in the audience, guiding future research and policy decisions.

While our bi-criteria framework effectively addresses the research questions and controls for moderating factors such as population vulnerabilities, there are inherent limitations. The robustness of the results heavily depends on the quality and consistency of data from various sources, which can vary significantly. Additionally, the reliance on ratios rather than absolute values in our DEA model was driven by data availability constraints. Although ratios facilitate comparisons across different health systems, they can obscure scale and absolute performance differences, potentially affecting the precision of efficiency assessments.

Future research should prioritize obtaining comprehensive absolute value data to enhance the robustness and accuracy of DEA models, enabling more direct and meaningful comparisons. Refining analytical methods, incorporating a more comprehensive range of criteria, and exploring advanced machine-learning techniques will improve predictive accuracy and robustness. Expanding the scope to include non-OECD countries will enable broader applicability and more comprehensive comparisons.

Despite these limitations, our research confirms that some NHCSs outperform others, highlighting the need for continuous evaluation and improvement of pandemic response strategies. This process demands ongoing commitment and collaboration among policymakers, healthcare managers, and researchers. By addressing these limitations and building on the findings of this study, future research can contribute significantly to the development of more resilient and effective healthcare systems.