There has been an unprecedented demand for pulse oximetry—a method of determining the oxygen saturation (SpO2) of the blood—during the COVID-19 pandemic to aid in medical decision-making. Pulse oximetry data are also widely used for medical research and algorithm development. The measurement of SpO2 involves shining light onto tissue at two separate wavelengths and derives the oxygen saturation from the relative changes in light absorption with each heartbeat. This pulsatile component is independent of skin pigmentation; however, other factors, such as the specific properties of the light source and the algorithms used by the product manufacturer, can produce variations that depend on skin pigmentation1.

A recent study revealed the clinical importance of racial disparities in pulse oximetry readings2. Specifically, when compared to measurements of arterial oxygen saturation, the pulse oximetry algorithms in these devices were found to produce systematically higher saturation values in Black patients compared to white patients. Such systematic racial biases could adversely affect clinical decision-making, such as triage for supplemental oxygen, due to the pulse oximetry readings of Black patients appearing artificially higher. Such inaccuracies may disproportionately increase the risk of unrecognized low oxygen saturation levels in Black patients under certain circumstances, for example in people with COVID-193.

Recent advancements in artificial intelligence (AI) have relied on using public databases to undertake feature extraction with pulse oximetry signals to assess hypertension4, estimate lung function5, and validate algorithms developed for monitoring patients with COVID-196. To prevent potential disparities in the calibration and accuracy of pulse oximetry devices and their algorithms, the pulse oximeter signals within such public databases need to be representative of the diverse populations on which these devices are used.

Since inaccuracies in pulse oximetry readings have been attributed to differences in skin pigmentation and skin pigmentation varies with race and ethnicity, it is essential to clarify these terms as used in this article. We have predominantly chosen to use the term ethnicity because that is the specific term used in the public datasets and is also the term that is identified by the patients themselves. For the purpose of discussing health disparities, while both race and ethnicity are social constructs, ethnicity has emerged as the preferred one since it encompasses cultural aspects of social identity7 that extend beyond the more simplistic view of race that is primarily based on shared skin pigmentation or physical characteristics8. While members of a given ethnicity can express a range of skin pigmentation, it is generally agreed that those patients who self-identify as Black generally have a darker skin pigmentation than other ethnic groups.

In order to investigate the proportion of individual ethnicities represented in publicly available pulse oximetry databases, we conducted a comprehensive assessment of accessible databases from 1st January 2012–1st January 2022 using PubMed consisting of Medical Subject Headings (MeSH) terms and Title/Abstract keywords. Applying the inclusion and exclusion criteria defined in Fig. 1 resulted in 12 research articles describing 12 publicly-available datasets to assess different medical conditions using pulse oximeter data.

Fig. 1: Flow chart of study identification, inclusion and exclusion criteria.
figure 1

Presentation of the literature search and selection procedure together with the numbers of records at each stage sourced from PubMed from 1 January 2012 to 1 January 2022 using the following keywords: ((“critical care”[MeSH Terms] OR (“critical”[All Fields] AND “care”[All Fields]) OR “critical care”[All Fields] OR (“oximetry”[MeSH Terms] OR “oximetry”[All Fields] OR (“oxygen”[All Fields] AND “saturation”[All Fields]) OR “oxygen saturation”[All Fields]) OR (“oximetry”[MeSH Terms] OR “oximetry”[All Fields] OR (“o2”[All Fields] AND “saturation”[All Fields]) OR “o2 saturation”[All Fields]) OR “PPG”[All Fields] OR (“photoplethysmogram”[All Fields] OR “photoplethysmograms”[All Fields]) OR (“photoplethysmography”[MeSH Terms] OR “photoplethysmography”[All Fields])) AND ((“publicly”[All Fields] AND (“availabilities”[All Fields] OR “availability”[All Fields] OR “available”[All Fields])) OR (“freely”[All Fields] AND (“access”[All Fields] OR “accessed”[All Fields] OR “accesses”[All Fields] OR “accessibilities”[All Fields] OR “accessibility”[All Fields] OR “accessible”[All Fields] OR “accessing”[All Fields])))) AND ((humans[Filter]) AND (english[Filter])). Papers that met the inclusion criteria, discussing the development and publication of original pulse oximetry datasets, were selected for analysis. This resulted in 12 research papers representing 12 publicly available datasets. Here n refers to the number of studies, where m refers to the number of publicly available databases.

In total, as of January 28th 2022, these databases have been used to produce over 6214 citations according to Google Scholar including 3544 citations for Medical Information Mart for Intensive Care (MIMIC III)9; 1049 citations for MIMIC II10; 531 citations for IEEEPPG Dataset11; 243 citations for Multiparameter Intelligent Monitoring in Intensive Care I (MIMIC I)12; 239 citations for WESAD13; 215 citations for Vortal Dataset14; 102 citations for the CapnoBase Dataset15; 87 citations for the University of Queensland Vital Signs Dataset16; 86 citations for PPG-DaLiA17; 63 citations for PPG-BP Dataset18; 50 citations for Wrist PPG Signals Recorded during Exercise19; and 5 citations for Medical Information Mart for Intensive Care IV (MIMIC-IV)20. We evaluated the existence of potentialdisparities in ethnicity based on the existing patient records as reported in the publicly available databases. In the absence of such information, the numbers of subjects of each category were inferred and quantified based on the authors’ research institutions’ locations or where the data was collected, as shown in Table 1.

Table 1 Summary of all the 12 publicly available datasets.

To avoid any uncertainty in the results of ethnic disparity analysis for a given population, databases with inferred ethnicity information were excluded from the statistical analysis. Four databases for which data for ethnicity was clearly stated, MIMIC, MIMIC-II, MIMIC-III and MIMIC-IV, were included in the statistical analysis. The distribution of ethnic groups in the four databases is shown in Fig. 2.

Fig. 2: Box plots of the ethnic makeup by proportion in all databases.
figure 2

This figure combines all the databases used in the four publicly available pulse oximeter databases that clearly reported the distribution of ethnic groups. The data supports the hypothesis that disparities exist here. Significant differences are evident between white and Black (p < 0.0001), white and Asian (p < 0.0001), and Black and Asian populations (p = 0.021). All pairs of groups were tested by using a simultaneous pairwise Tukey test. The bottom and the top of the box are the 25th and 75th percentiles, and the line inside the box is the 50th percentile (median). Whiskers from minimum to maximum are determined with a 95% confidence interval.

We tested the statistical significance among all the subjects in the four databases considering a p-value <0.05 as statistically significant and analyzed the variance using a one-way ANOVA followed by post hoc test to provide simultaneous two-way interactions using the Tukey’s honest significant difference criterion. The results indicated that there was a significant difference between the mean distributions of all racial groups; Asian and Black (p = 0.021), Asian and white (p = 4.10 × 10−14), and Black and white (p = 5.01 × 10−13). The same trend was observed between Other and Asian (p = 9.43 × 10−05), Other and Black (p = 0.026), and Other and white (p = 4.82 × 10−12). The results also suggested a higher proportion of white subjects compared to Asian, Black and other populations. These results demonstrate the existence of clear disparities in these key databases. Detailed results on the statistical separability tests for all pairs of demographic groups are provided in Table 2.

Table 2 Tukey simultaneous tests for differences of means.

In the remaining databases in which ethnicity was not explicitly stated, the ethnic disparity is not known. However, if we examine the demographic statistics of each data set, based on location, we see that significant potential for disparity exists. For example, the Vortal dataset was collected in the UK in 2016, and the authors did not provide the race of each participant. Based on government records, we can infer the ethnic distributions based on UK ethnicity statistics: 7.5% Asian, 3.4% Black, 0.1% Other, and 80.0% white. The same method to infer ethnicity was used for the remaining databases, as shown in Table 1. Furthermore, since the racial groups were not clearly defined, it does suggest a lax approach to the matter of constructing reference databases, mainly when applied to vascular optical measurement technology that can be influenced by skin color characteristics. White subjects appeared in all four MIMIC databases where the ethnicity was clearly stated, constituting an average of 73.19% of the total population. However, Black subjects only accounted for an average of 9.29% of the sample population. In addition, Asian subjects comprised an average of 2.67% of the total population investigated. Such distributions highlight the potential for racial and ethnic biases in algorithms and devices, leading to possible challenges in their wider application in medicine.

Our findings highlight clear disparities in pulse oximetry databases. As these biased databases would be used during the premarket phase to adjust pulse oximeter accuracy and to develop algorithms for oxygen saturation determination, they place subjects with darker skin pigmentation at increased risk of unrecognized health conditions3. Such health inequalities necessitate the development of new pulse oximeter databases with more racially balanced populations. Our recommendation does not deny the value of exploiting existing biased databases; rather, it attempts to benefit from using these publicly available databases when testing developed algorithms, as well as aiming for more balanced populations in future databases. Asian and Black populations have low representation in existing databases and it would also be beneficial to create an increased number of databases from different geographical regions.

Since last year, the US Food and Drug Administration has started to issue new guidelines to evaluate pre- and post-market pulse oximeters3, and to increase awareness of racial and ethnic disparities that can affect the accuracy of pulse oximetry algorithms. As publicly-accessible databases are commonly used for the development of many biomedical algorithms and devices, our findings highlight the need to improve device algorithms and expand these databases to better represent a diversity of skin pigmentations regardless of the racial or ethnic group. Improving diversity in public databases would help improve the general accuracy of AI algorithms, especially for measurements that involve frequently life-threatening conditions such as COVID-19.