Introduction

Continuous pulse oximetry has been a valuable tool for Neonatologists since the early 1980s [1, 2]. Titration of supplemental oxygen to maintain a narrow window of oxygen saturation is essential to reduce the risk of retinopathy of prematurity [3, 4], bronchopulmonary dysplasia [5, 6], and death [7, 8]. Continuous pulse oximetry (SpO2) is superior to clinical observation alone; without it, desaturation can only be detected once arterial saturation (SaO2) has dropped below 80% and cyanosis develops [9, 10]. Pulse oximetry also avoids frequent phlebotomy for blood gas analysis, which is painful and causes iatrogenic anemia.

Oxyhemoglobin (HbO2) and deoxyhemoglobin (Hb) absorb red (660 nm) and infrared (940 nm) light differently—saturated blood permits increased transmission of red light but decreased transmission of infrared light. Accordingly, a pulse oximeter consists of red and infrared light emitters and a photoreceiver positioned on opposite sides of an arterial bed and measures the quantity of light transmitted through tissue. As oxygen saturation is the ratio of oxyhemoglobin to total hemoglobin, arterial saturation (SaO2) can be derived from variation in light absorption [11,12,13].

Melanin is a secondary absorber of near-infrared light and may impact pulse oximeter accuracy. A recent comparison of oxygen saturation determined by pulse oximeter and arterial blood gas (ABG) in adults, demonstrated a notable overestimation of oxygen saturation in patients self-identified as Black [14] and an increased incidence of occult hypoxemia. Another recent study of adults admitted to the ICU with COVID-19 [15] identified suboptimal pulse oximeter accuracy. As this study used a cohort with nearly 70% of individuals identified as Black, Asian or Minority Ethnic, the authors speculated that greater melanin concentration may have contributed to increased inaccuracy [16]. Given the link between hypoxemia and adverse outcomes in preterm infants, overestimation of oxygenation may be problematic.

For this study, we identified a cohort of preterm infants born before 32 weeks gestation who had a simultaneous collection of timed ABG samples and pulse oximetry. We hypothesized that differences in secondary light absorption between Black and White infants will lead to systematic error in pulse oximeter-based determination of arterial oxygen saturation.

Methods

Cohort development

All preterm infants born between 2012 and 2019 with a gestational age of less than 32 weeks, birth weight less than 1500 g, and admitted to the St. Louis Children’s Hospital Neonatal Intensive Care Unit (SLCH NICU) were eligible for inclusion.

Infants admitted to the SLCH NICU undergo continuous vital sign monitoring including pulse oximetry. Patient monitors were either Philips IntelliVue MP70 or MX800 (Philips Medical, Andover, MA), but both use a common pulse oximeter, the Nellcor SpO2 Module (Medtronic, Minneapolis, MN) with the Neonatal-Adult MAX-N adhesive SpO2 sensor (Covidien, Mansfield, MA). Vital sign data are automatically captured in an electronic database (BedMasterEX, Excel Medical, Jupiter, FL) sampled once per second (1 Hz). During the study period, all infants had standardized oxygen saturation targets to maintain SpO2 between 90 and 95% with alarm limits set between 88 and 96%. After infants reached 35 weeks post-menstrual age (PMA), the target range was changed to 90–100% with the desaturation alarm set to 88%.

Infants were included in the study if they met gestational age and birth weight criteria, had valid vital sign data, and at least one ABG performed during hospitalization. Standard clinical variables were collected including gestational age, birth weight, sex, antenatal steroid exposure, method of delivery, and Apgar scores. Infants were classified as Black or White based on parental identification on birth certificates. Infants of Hispanic, Asian, or unspecified descent make up a small proportion of admissions to the SLCH NICU and were excluded as there would be an insufficient number for a representative sample. The study was reviewed and approved by the IRB at Washington University under waiver of consent.

ABG analysis

Invasive arterial lines are placed for frequent blood sampling and/or arterial blood pressure monitoring. ABG analysis was performed in a consistent manner across all patients—the line was accessed and cleared and a minimum sample volume of 0.5 mL was obtained and immediately brought to the clinical laboratory where it was run on a gas analyzer (ABL800 Flex, Radiometer America, Brea, CA) yielding a measurement of arterial oxygen saturation (SaO2). For each patient in the cohort, the measured SaO2 and the date/time of sample acquisition were recorded.

SpO2 processing and bias calculation

Raw recording files were converted to MATLAB format (The MathWorks, Natick, MA) using conversion software (University of Virginia, Charlottesville, VA). A processing script identified the location corresponding to the date/time of the ABG and extracted a time-matched 60-second window of SpO2 data centered on the ABG timepoint (30 s before and after) which was then averaged. This average of over 60 s was employed to reduce the impact of transient fluctuations in the SpO2 and to mimic the typical length of time taken to draw an arterial sample for the SaO2.

Accuracy measurements

The accuracy of pulse oximetry to estimate arterial oxygen saturation can be evaluated in several different ways. The primary methods used in FDA clearance are mean bias and accuracy root mean squared, however, we performed an expanded analysis of pulse ox accuracy across six different metrics including:

  1. 1.

    Mean bias (B) – Mean bias is the average difference between SaO2 and SpO2 and is calculated using the formula \(B = \frac{{\mathop {\sum }\nolimits_{i = 1}^n (SpO_2 - SaO_2)}}{n}\). Positive values resulting from this calculation indicate an overestimate of SaO2 by the pulse oximeter while negative values indicate an underestimate.

  2. 2.

    Accuracy root mean squared (Arms) – Arms is a related measure of the average difference between SaO2 and SpO2 and is calculated using the formula \(A_{rms} = \sqrt {\frac{{\mathop {\sum }\nolimits_{i = 1}^n (SpO_2 - SaO_2)^2}}{n}}\) [17]. Given the quadratic nature of this calculation, Arms is always a positive value and ranges from 0 (no error at all) and increases as the number of errors increase.

  3. 3.

    Proportional bias – Also called Bland–Altman analysis, this method was developed to quantify differences between measurement methods. Like B and Arms, the mean difference between the two methods is first calculated. From these differences, the 95% limits of agreement can be computed as the average difference ± 1.96 standard deviations. Bland–Altman can be used to identify the relationship of discrepancies between two measurement methods, also called proportional bias. The presence of proportional bias indicates that the degree of disagreement varies over the range of measurements.

  4. 4.

    Prevalence of occult hypoxemia – The occult hypoxemia definition of Sjoding et al. was adapted to the preterm population. In this case, occult hypoxemia was defined as a true SaO2 < 85% when SpO2 reads a value in the normal range (SpO2 ≥ 90%).

  5. 5.

    Sensitivity/specificity for detection of occult hypoxemia – The sensitivity of the pulse oximeter to detect true hypoxemia (SpO2 < 90 % when SaO2 < 85%) was calculated as \(\frac{{True\; positive}}{{True\; positive + False\; negative}}\). The specificity of the pulse oximeter to detect true hypoxemia was calculated as \(\frac{{True\; negative}}{{True\; negative + False\; positive}}\). Sensitivity and specificity calculations were made for the overall cohort and within racial groups.

While B and Arms were calculated across the entire range of SpO2 values, an evaluation of local bias was also conducted on the “clinically relevant” SpO2 range of 85–100%. An additional limited investigation was performed to study the impact of post-menstrual age (PMA) at the time of sampling on measurement error.

Statistical approach

Infant characteristics underwent univariate comparison using non-parametric methods including Fisher’s Exact Test for categorical variables and Mann–Whitney U test for continuous variables. The proportion of samples where occult hypoxemia occurred was compared between Black and White infants using Fisher’s Exact Test.

The relationship between SpO2 and SaO2, bias, and PMA were modeled using the Pearson correlation coefficient and conventional linear regression. Non-linear regression was performed using a conditional mean function, where the outcome variable (e.g., SaO2, bias score) was modeled as a function of SpO2. In this approach, a larger dataset is convoluted or broken into smaller subsets, to which low-order polynomials are fit using a least-squares approach, weighting points closer to the subset more than those distant. Smoothed conditional means were calculated using the ggplot2 package for the R statistical package (R version 4.0.3, R Foundation for Statistical Computing, Vienna, Austria) utilizing the locally estimated scatterplot smoothing (LOESS) method.

Results

Study cohort description

A total of 639 infants met gestational age and birth weight criteria; 186 infants were excluded as they did not have an ABG performed during their NICU hospitalization, 125 infants were excluded for lack of SpO2 data at the time of ABG, and 34 infants were excluded for Asian, Hispanic, or not listed background, yielding 294 infants. For the remaining 294 infants, a total of 4,387 SaO2–SpO2 pairs were available for further analysis. The median number of samples per infant was 11 (IQR 4–23) and the median postnatal age at sampling was four days (IQR 2-7), reflecting our typical clinical practice of obtaining an arterial blood gas every 8 h from an arterial catheter in place for an average of four days following birth for VLBW infants. Most of the samples were obtained within the first week of life (75% within seven days, Supplemental Fig. 1).

Of the 294 included infants, 42% were Black and 58% were White, consistent with the general demographic profile of infants admitted to the SLCH NICU (40% Black, 50% White, 10% Asian, Hispanic, or not listed). The two groups of infants were similar except for slightly lower birth weight (805 g vs. 875 g, p = 0.02) and median one-minute Apgar score (2 vs. 3, p < 0.01) in the Black infants. All other characteristics were not statistically different. A complete descriptive summary of the cohort can be found in Table 1.

Table 1 Cohort descriptive statistics.

Occult hypoxemia

The number of data samples was balanced between the two groups, with 2044 samples for Black infants and 2342 samples for White infants. True hypoxemia (defined as SaO2 < 85%) was noted slightly more often in Black infants, being identified in 312/2044 (15.2%) of samples as compared to 293/2342 (12.5%) of samples for White infants. Occult hypoxemia, (defined as SaO2 < 85% when SpO2 ≥ 90%) was more common in Black infants, occurring in 188/2044 (9.2%) of samples compared to 181/2343 (7.7%) of samples for White infants, although this difference did not meet statistical significance (p = 0.08).

Sensitivity and specificity for detection of hypoxemia

Of the 4387 SaO2–SpO2 pairs collected in this study, 605/4387 (13.7%) were noted to have true hypoxemia. Overall, the sensitivity of the pulse oximeter for detecting true hypoxemia (defined as SpO2 < 90% when SaO2 < 85%) was 38% while the specificity was 89%. In subgroup analysis by race, sensitivity and specificity were similar for Black infants (39% sensitive, 81% specific) and White infants (38% sensitive, 78% specific).

Mean bias and Arms

The bias of each individual measurement was calculated as previously noted (SpO2 saturation–ABG saturation) where a positive value indicates an overestimation of the true arterial saturation, and a negative value indicates an underestimation. The mean bias of the overall sample was 1.19%, indicating an overall overestimation of true oxygen saturation by a little more than 1% for all infants. Local bias was calculated for the “clinically relevant” range of saturations between 85 and 100% and was noted to be somewhat higher at 1.79%.

When comparing mean bias between Black and White infants, overestimation by the pulse oximeter is noted to be 2.4-fold greater for Black infants compared to White infants, (mean bias of 1.73% vs. 0.72%, p < 0.01). A similar discrepancy is noted between Black and White infants when looking at local bias for saturations between 85 and 100%, with 1.5-fold greater overestimation for Black infants compared to White infants (2.22% vs. 1.41%, p < 0.01).

Arms over the entire range of saturations was notable at 9.2%, exceeding the desired specification of 2–3% [17, 18]. As would be expected, Arms worsened as saturations decreased with Arms = 7.9% for SpO2 90–100%, 10.8% for SpO2 80–89%, 16.9% for SpO2 70–79%, and 27.9% for SpO <70%. Consistent with other measures of pulse oximeter accuracy, there was a racial discrepancy in Arms values between Black and White infants (9.5% vs. 8.9%).

Bland–Altman analysis

In Bland–Altman analysis, an overall bias of 1.19% was confirmed, with a 95% confidence interval noted to be between 0.92–1.46% over the entire range of oxygen saturation. As shown in Fig. 1, the magnitude of the difference quickly goes outside of the control limits (±1.96 SD) once the oxygen saturation drops below 90% with an increasing large overestimation as saturation decreases.

Fig. 1: Bland–Altman plot demonstrating agreement of ABG and SpO2.
figure 1

Note the positive bias overall (indicating overestimation) which worsens as SpO2 decreases.

Linear and non-linear correlation

The overall linear correlation between SaO2 and SpO2 was moderate, albeit statistically significant (R2 = 0.244, p < 0.01). When examined separately, the correlation is slightly stronger for Black infants than White infants (R2 = 0.254 vs. 0.236). In Fig. 2, a scatterplot of all SaO2–SpO2 values is shown, categorized by race. The slope and intercept of the best fit line demonstrate that not only do Black infants have higher recorded SpO2 measurements for a given SaO2, but this difference also increases as SpO2 drops.

Fig. 2: Scatterplot of oxygen saturation measured by pulse oximeter (x-axis) and ABG (y-axis).
figure 2

Black infants are shown as gray dots, White infants are shown as white dots. Regression lines are shown for each race group. The line of unity is shown as a diagonal gray line.

Non-linear analysis (Fig. 3) demonstrates consistency in bias between Black and White infants only when SpO2 is between 96 and 100%. When SpO2 is ≤95%, there is a persistent bias gap between Black and White infants with a higher SpO2 value for a given SaO2 in Black infants, although the exact degree varies. While all pulse oximeters follow the same general pattern of overestimation at lower saturations and underestimation at higher saturations, the exact location of this crossover point is different for Black infants, which occurs at an approximate SpO2 of 90% compared to 92% for White infants. When SpO2 is less than 96%, there is a widening gap in the degree of error between Black and White infants, with greater underestimation of SpO2 in the White infants and overestimation in Black infants (Fig. 4).

Fig. 3: Measurement bias (SpO2 saturation–ABG saturation) is shown, clustered by SpO2 saturation and race.
figure 3

Black infants are shown as gray dots, White infants are shown as white dots. Non-linear regression lines are shown for each group (solid for Black infants, dashed for White infants). Shaded areas represent 95% CI.

Fig. 4: Measurement bias is shown, clustered by SpO2 saturation and race over the clinically relevant SpO2 range of 85-100%.
figure 4

Black infants are shown as gray dots, White infants are shown as white dots. Non-linear regression lines are shown for each group (solid for Black infants, dashed for White infants). Shaded areas represent 95% CI.

The mean bias by race was evaluated over the range of corrected PMA at the time of measurement. As shown in Supplemental Fig. 2, there is no notable variance by corrected PMA, although there is a significant amount of uncertainly after 35 weeks PMA due to sparse samples.

Discussion

These data reveal a complex relationship between pulse oximetry-based estimation of arterial oxygen saturation and race. Over this collection of SaO2–SpO2 pairs, Black infants were noted to have true hypoxemia more often than White infants. Pulse oximetry overestimated the true arterial oxygen saturation of Black VLBW infants by 1% more on average compared to White infants. The direction of error also varies by SpO2, with overestimation of the true saturation when SpO2 is greater than 90% and underestimation at saturations of 90% or below. The greater degree of imprecision for Black infants is well-summarized by the difference in Arms values (9.5% vs. 8.9%). The aggregate impact of each of these small differences was an increased incidence of occult hypoxemia in Black infants, occurring in 9.2% of samples compared to 7.7% of samples for White infants.

Although the difference in measurement bias between Black and White infants is small, these one-minute samples represent only a small fraction of each infant’s NICU hospitalization. There are many examples in the literature of the association between hypoxia and severe outcomes including an increased incidence of intraventricular hemorrhage (IVH) [19, 20] and death [7, 8]. In a previous publication, we identified that the difference in time spent with severe hypoxia between infants with severe IVH and those without severe IVH was less than 3% of the total recording time [21]. Thus, even relatively small increases in hypoxia burden driven by underestimation of true arterial oxygen saturation may have profound consequences. Although the degree of over- or underestimation varies by SpO2 value, infants spend much of their hospitalization with saturations greater than 85%, thus SpO2 performance in this range is the primary driving factor and yields a net average overestimation for Black infants.

In the BOOST-II and SUPPORT trials, pulse oximeters with a masking algorithm were used to force separation in the distribution of oxygen saturations while still within what was intended to be the normal range of SpO2 for preterm infants. The degree of separation generated by the algorithm was small, on the order of 2.5–3% [22], but the difference in the outcome, namely mortality, was much greater (23.1 vs. 15.9%) [8] in the lower saturation group. The race-based systematic error identified in our study (1%) makes up a significant proportion of that separation. Indeed, when the distribution of SpO2 values for infants in this study are compared, Black infants have a greater distribution of higher SpO2 values while also a greater number of lower SaO2 values (Supplemental Fig. 3). In a recent manuscript [23] we described a significant gap in estimated mortality risk between Black and White infants, with greater predicted mortality for Black infants when all other factors were held constant. It is possible that a disparity in extremes of oxygenation is a contributing factor.

Factors altering the absorption of red and near-infrared light should be considered. Even at low gestational ages, infants have mixed hemoglobin subtypes. While the predominant subtype is fetal hemoglobin (HbF), adult hemoglobin (HbA) is also present with estimates ranging between 10 and 20% [24, 25]. In some investigations, the absorption spectra of fetal and adult hemoglobin were found to be identical [26, 27], however, there is some evidence that an admixture of HbF and HbA results in impaired pulse oximetry accuracy (underestimation) by 3–4% [28, 29], although this effect occurs primarily at lower oxygen saturation [30]. Although not measured, the slightly lower gestational age of Black infants in this cohort may have a higher proportion of HbF contributing to greater SpO2 error.

Skin pigmentation may also play a role in pulse oximeter accuracy. Melanin is a pigment produced by melanocytes located in the basal epidermis after exposure to ultraviolet radiation (UV, 280–315 nm) [31]. Although melanogenesis accelerates during periods of increased UV-B exposure (suntan), different amounts of basal melanin production result in varying skin tones [32]. Although the peak absorption frequency of melanin is between 400 and 600 nm (ultraviolet band), it also absorbs light across visible and infrared frequencies and is a particularly strong absorber in the near-infrared range [33]. As light absorption in the near-infrared spectrum is used by the pulse oximeter to identify the quantity of oxy- and deoxyhemoglobin, additional absorption by a secondary chromophore (melanin) disrupts the expected relationship. This potential problem has been borne out in experimental data demonstrating that increasing amounts of melanin lead to increasing error in spectroscopic measurements of hemoglobin species [33].

Melanocytes can be identified in the embryonic epidermis in the first trimester [34]. There has been limited examination of the developmental trajectory of melanin by melanocytes at different gestational ages, although a small study suggests that significant differences in skin reflectance (indicative of sufficient melanin to alter light absorption) between White and Black patients do not occur until 32 weeks corrected gestational age [35].

Complicating this analysis is the frequency with which these infants receive phototherapy for jaundice. Phototherapy lamps are engineered to generate light in the same frequency range that unconjugated bilirubin maximally absorbs light (340–540 nm) [34]. This spectral range overlaps with the band which activates melanocytes including UV-A and B (290-400 nm) [36]. There are a number of reports of an increase in melanin production following typical clinical phototherapy treatment in Black and Asian infants [37]. Increased melanin production in response to phototherapy may accelerate the change in skin reflectance beyond what might be expected for gestational age, introducing the possibility of pulse oximetry error at younger than expected gestational ages.

The results of this study raise important questions about the utility and reliability of pulse oximetry in critical care. The modest correlation between SaO2 and SpO2 has been noted in other studies of infants and children [38,39,40], especially at lower saturation, and is likely the result of using healthy volunteers for calibration which may not be representative of the sick infant [11]. The impact of skin pigmentation has been investigated in several previous studies of adults and older children, although not in preterm infants. Studies of adults have reported mixed findings; some have identified overestimation of oxygen saturation by 2–10% in participants with greater skin pigmentation [41, 42]. Other studies [43, 44] have failed to replicate this difference, although both studies noted lower SpO2 signal quality in adults with greater skin pigmentation. There has been only a single investigation of racial differences in pulse oximeter performance for infants [45] and it did not demonstrate a difference between infants with light and dark pigment. However, this study had a small sample size and strict inclusion criteria (term infants, no anemia or hypotension, stable SpO2 for 2 minutes prior to sample) which are not typical for premature infants.

Although the validation data for the pulse oximeter used in this study is not publicly available, there are several other published neonatal validation studies using Nellcor-based pulse oximeters. Unfortunately, the race of the study population is either not provided in these reports [46,47,48,49,50,51,52] or Black infants are underrepresented in the study cohort [29]. Without intentional oversampling of Black infants or stratified analysis of infants by race, disparities in device performance are not apparent.

Infants in this study received a race classification based on self-report in birth certificate documentation. This binary approach does not capture the continuum of skin pigmentation, which likely has a variable impact on pulse oximeter accuracy. Several different systems for quantifying skin color have been suggested including the Fitzpatrick Skin Phototype [53], which uses a combination of visual appearance and response to ultraviolet light (burning vs. tanning). An alternative approach is the individual typology angle (ITA) which utilizes digital photography of a skin sample under highly controlled lighting conditions to quantify the amount of red-green, yellow-blue, and lightness-darkness (L*a*b* color) using software [54]. Although considerably more complex, melanin levels can also be quantified spectroscopically. Indeed, there are research near-infrared spectroscopy devices, tissue oxygen monitors which operate on similar physics principles to pulse oximetry, which utilize algorithms to quantify [55] and remove the influence of melanin in measurements [56, 57]. Notably, none of these approaches have been evaluated in the neonate.

There are several limitations of this study. First, blood gas samples can be contaminated by the accidental introduction of air into the syringe. This is a known problem in all blood gas analyses and samples are carefully examined (and potentially discarded) before the assay is run, minimizing this risk [58].

Second, the SpO2 sensor placement is rotated every twelve hours to prevent skin injury. In positions other than the right upper extremity, the probe is in a post-ductal position. For preterm infants with a patent ductus arteriosus, there is the possibility of a mismatch between pre- and post-ductal measurements. Sensor placement is not routinely charted in the medical record and PDA screening is not universal, thus is it impossible to reconstruct when or how often this occurred.

Finally, infants in this study were assigned a binary race classification based on birth certificate data. The retrospective nature of this study prevents qualitative or quantitative assessment of skin tone for further interrogation. Future prospective studies should quantify melanin content, ideally spectroscopically, for a more granular understanding of the relationship between skin tone and device performance disparity. It is essential that preterm neonates are not excluded from correction algorithms investigation.

In conclusion, we find a small but consistent racial disparity in oxygen saturation measurement by pulse oximetry and an increased incidence of occult hypoxemia in Black preterm infants. There is increasing awareness of racial disparities in outcomes of preterm infants, particularly the risk of mortality [59,60,61]. Knowledge of the potential for occult hypoxemia may lead to changes in oxygen saturation targeting, with more attention paid to the avoidance of low-normal saturations to reduce the risk of adverse outcomes.