Introduction

The ratio of the partial pressure of oxygen (PaO2) to the fraction of oxygen (FiO2) delivered, or the PaO2/FiO2, is the reference standard measurement for the assessment of low blood oxygen levels, or hypoxemia, in mechanically ventilated patients with respiratory failure. The PaO2/FiO2 ratio (PF ratio) has predictive value for mortality in patients with acute respiratory distress syndrome (ARDS)1 and is also part of a severity index scoring system called the Sequential Organ Failure Assessment (SOFA) score that is used to predict severity of illness in patients with critical illness2,3,4. Additionally, the PF ratio is relevant to clinical decision-making including the decision to initiate prone positioning in ARDS patients with PF ratios ≤ 1505. Currently, measurement of the PF ratio requires invasive arterial blood gas (ABG) sampling and does not provide a continuous measure of the patient’s oxygenation. Increasingly, non-invasive monitoring with pulse oximetry is utilized instead of ABGs6,7, particularly in low-resource settings where ABG monitoring may not be readily available. In contrast to invasive blood gas sampling, the SpO2 (peripheral saturation of oxygen)/FiO2 ratio can be calculated without blood collection, arterial puncture, or blood gas analyzers and may serve as a surrogate for the PaO2/FiO2 ratio. Notably several studies have evaluated the SF ratio in children where non-invasive measurements are increasingly favored8,9,10.

A few studies have examined non-linear imputation of PaO2/FiO2 from SpO2/FiO2 measurements recorded at the same time11,12. These studies have reported that the accuracy of non-linear imputation is superior to log-linear or linear imputation, especially for moderate to severe hypoxemic respiratory failure with ARDS where the PF ratio is < 20011,13. However, in patients with respiratory failure requiring mechanical ventilation, the optimal equation for imputation of PaO2/FIO2 from the SpO2/FIO2 remains unclear. An algorithm to accurately impute the PaO2 from the SpO2 in mechanically ventilated patients would be beneficial for predictive modeling and clinical research to facilitate recruitment of patients for clinical trials if an ABG is not available. Ideally, this approach would include only variables that contribute to the relationship between SpO2 and PaO2 but would not require the same invasive ABG measurement as the PaO2. From the clinical perspective, SF ratio can be utilized as a surrogate for PF ratio to diagnose ARDS or ALI with less invasive nature and comparable reliability14.

The objective of this study is to develop a calculator utilizing machine learning algorithms to impute PaO2 using non-invasive SpO2 measurements from mechanically ventilated patients in the Medical Information Mart for Intensive Care (MIMIC) III database15 and to compare the accuracy of the machine learning models to the previously published non-linear and log-linear equations11,13. In this study, three common machine learning approaches (neural network16, regression17, and kernel-based methods18,19) were tested for regression and classification tasks using data available in MIMIC III20 with 7 clinical variable features and a subsequent 3-feature model. We created models to perform a regression task to impute PaO2 from SpO2 values and a classification task to predict patients with moderate to severe hypoxemic respiratory failure based on a cut-off of a predicted PF ratio ≤ 15011. Our overall hypothesis is that a machine learning algorithm would perform better in predicting the PaO2 from SpO2 across the entire span of SpO2 values when compared to the previously published equations.

Methods

The MIMIC-III database v1.4 (https://mimic.physionet.org) is an openly available dataset developed by the Massachusetts Institute of Technology Lab for Computational Physiology15. It contains de-identified health data associated with approximately 40,000 intensive care unit admissions for patients admitted to critical care units in the Beth Israel Deaconess Medical Center between 2001 and 2012. MIMIC-III is a relational database that contains information on demographics, vital signs, mechanical ventilation status, laboratory tests, medications, and mortality. We also utilized a validation cohort obtained from an existing database of de-identified clinical information from intensive care unit patients with Pseudomonas aeruginosa respiratory isolates from 2 hospitals within the University of Pittsburgh Medical Center (UPMC). This dataset similarly contains information of demographics, mechanical ventilation status, ventilator parameters and laboratory tests. Our study utilizing the MIMIC-III database was determined as exempt by the University of Pittsburgh Institutional Review Board (STUDY19100068). The University of Pittsburgh Institutional Review Board approved the Pseudomonas aeruginosa ICU respiratory isolates database as waiver of informed consent (STUDY21030010) and also approved the use of this database as an independent validation cohort (STUDY21090073). All methods were carried out in accordance with relevant guidelines and regulations.

Data processing

For the MIMIC-III database, we identified unique ICU encounters (icustay_id) with mechanical ventilation status. We next identified the lab event PaO2 and chart event SpO2 occurring at the same time of the mechanical ventilation status. In order to minimize error between matched PaO2 and SpO2, we constrained the time gap between the lab event PaO2 and the chart event SpO2 to be no more than 30 min. To minimize repeated sampling from the same subjects, we restricted the search of PaO2 measurements to the first 24 h of mechanical ventilation and obtained the first PaO2 recorded within this time frame. For chart events including tidal volume (TV), positive end-expiratory pressure (PEEP), FiO2, temperature, and mean arterial pressure (MAP), we constrained the time gap to within 2 h of the selected SpO2 measurement. If a patient was treated with vasoactive infusions, it was recorded as a categorical variable. Data extraction and processing methods are available at https://github.com/renshuangxia/PaO2PredictorDjango21. The online calculator is available at https://dikb.org/pa02-predictor.

For the 3-feature model in the UPMC validation cohort, the database was queried for unique ICU patients requiring mechanical ventilation. The validation set cases include 133 discrete individuals with ABGs obtained within 30 min of an SpO2 reading similar to the constraints defined in the MIMICS III derivation cohort.

Machine learning methods for regression task

For the regression task we implemented 3 different models—a neural network model, a linear regression model, and support vector regression (SVR), a type of kernel-based modeling. For each model, we applied a tenfold cross-validation22.

For the neural network model, we tested different network structures and various numbers of features to arrive at two models used for comparison with the linear and support vector regression models. One model used seven input features and three hidden layers (16, 8, 5 neurons for layers 1–3). The other model used only three input features and two hidden layers (6, 3 neurons for layers 1 and 2). Both final models used a tangent activation function for all layers except the output layer which used a linear function in both models. Also, both models were trained for 200 epochs with Adam optimizer using gradient descent. The learning rate was 0.001 and the batch sizes were 50 for both models.

For the linear regression model, the output variable can be computed by a linear combination of the input variables. We trained the linear regression equation by the Ordinary Least Squares approach. We used the linear_model.LinearRegression method from scikit-learn 0.22 (https://scikit-learn.org/stable/) with default hyperparameters for predicting PaO2 values.

For the SVR model, we tested multiple kernels including linear kernel, polynomial kernel, and radical basis function kernel (RBF). Based on the performance in the training data, the RBF kernel was selected.

Machine learning methods for classification task

We utilized PaO2/FiO2 ≤ 150, an accepted threshold previously utilized to capture patients with moderate to severe disease meeting the criteria for ARDS11,13. We utilized this cut-off to test machine learning methods to predict this diagnostic threshold PaO2/FiO2 ≤ 150 for the different imputation techniques. We implemented three classification models including neural network, logistic regression, and a kernel-based model, SVM.

For each machine learning model, we applied a tenfold cross-validation and calculated the sensitivity, specificity, likelihood ratios, diagnostic Odds Ratio (OR), Area Under Receiver Operating Characteristic curve (AUROC), F1 score and Bayesian Information Criterion (BIC) to compare across models. The two neural network models for classification were similar to the neural networks used in regression, except the output layer used the sigmoid function. As with the regression models, various topologies were tested to arrive at the final two multi-layer perceptron (MLP) classifiers, one with an input size of seven features and the other with an input size of three features. The hidden layer size is (12, 8, 6, 4, 4) for the model with seven input features. For the other model which utilizes only three input features, we used two hidden layers of size 6 and 3. All hidden layers used the tangent activation function. We trained both models for 200 iterations with Adam optimizer, setting seven feature classifier momentum value as 0.8 and three feature classifier momentum value as 0.6. The learning rate was 0.001 and the batch size was 200 for both models.

In addition, we implemented a basic logistic regression model for classification purposes as well as the SVM model which classifies examples with an optimal hyperplane. For the logistic regression, it uses logistic function to model a binary dependent variable. We utilized the linear_model.LogisticRegression method provided in the scikit-learn library without regularization, and other arguments were set as default. For the SVM model, we compared the results by applying different kernels and the RBF kernel outperformed other kernels. Methods were similar to those used in the regression task.

Comparison of machine-learning based algorithm to published non-linear and log-linear equations

We compared the performance of our machine learning algorithms to the previously published equations. For the non-linear equation from Brown et al.11 the PaO2 was imputed from the SpO2, where PO2 = PaO2, S = SpO2 and F = FiO2 which is illustrated in the Eq. (1). For situations where the recorded SpO2 was 100% (or, 1.0), the SpO2 was substituted with 0.996 given that the equation would not permit the calculation of S = 1.0.

Non-linear equation to impute PaO2 from the SpO2 (Reprinted with permission - see Acknowledgment section).

$$\begin{aligned} PO_{2} & = \left\{ {\frac{11,700}{{\left( {{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 S}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$S$}} - 1} \right)}} + \left[ {50^{3} + \left( {\frac{11,700}{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 S}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$S$}} - 1}}} \right)^{2} } \right]^{1/2} } \right\}^{1/3} \\ & \quad + \left\{ {\frac{11,700}{{\left( {{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 S}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$S$}} - 1} \right)}} - \left[ {50^{3} + \left( {\frac{11,700}{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 S}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$S$}} - 1}}} \right)^{2} } \right]^{1/2} } \right\}^{1/3} . \\ \end{aligned}$$
(1)

For the log-linear equation from Pandharipande et al.11,13, the PaO2/FiO2 was imputed from SpO2/FiO2 utilizing the Eq. (2):

Log-linear equation to impute PaO2 from the SpO2 (Reprinted with permission - see Acknowledgment section).

$$PO_{2} = F \cdot 10^{{\left( {0.48 + 0.78 \cdot log_{10} \left( \frac{S}{F} \right)} \right)}} .$$
(2)

Sensitivity analysis

To compare the performance of our machine learning algorithms to previously published equations, a sensitivity analysis was performed by selecting either self-reported White or Black race. For each machine learning model, we implemented a tenfold cross-validation and calculated the sensitivity, specificity, likelihood ratios, diagnostic OR, AUROC, F1 score, RMSE (root-mean-square deviation), and BIC to compare across models.

Results

A parsimonious three features model is sufficient to impute PaO2/FiO2 ratio using a large dataset

An overview of the machine learning tasks is outlined in Fig. 1. We initially chose seven relevant features from the chart events (SpO2, FiO2, TV, MAP, temperature, PEEP and vasopressor administration) representing recorded bedside measurements that were independent from an invasive arterial blood gas measurement. When applying the seven features to impute the PaO2, the final data set contained 9900 unique ICU encounters from 9302 mechanically ventilated patients (Supplementary Table e1). The relationship between SpO2/FiO2 (S/F) and the PaO2/FiO2 (P/F) was examined in dataset 1 containing 9900 unique ICU events from the MIMIC-III database and was best described by a log-linear relationship between the transformed logarithmic value of the SF and PF ratios as previously described by Pandharipande et al.13 (Supplementary Fig. e1). The relationship between S/F and P/F ratios showed high variance across the distribution of mechanically ventilated subjects (R2 = 0.21).

Figure 1
figure 1

Overview of the experimental study design.

For the regression task, we derived the RMSE and BIC for each of the different seven feature machine learning models (neural network, linear regression, support vector regression) to assess the performance of the imputation techniques. The RMSE and BIC of the three machine learning methods are shown in Supplementary Table e2. All the machine learning models outperformed the previously published non-linear and log-linear equations as shown by lower RMSE score; the same was observed for subset 1 (SpO2 < 97%). For the classification task, the three machine learning methods achieved similar classification performance according to F1 scores, as shown in Supplementary Table e3; the same pattern was observed for subset 1 (SpO2 < 97%).

To improve practicality of the method at the bedside, we attempted to use the smallest number of features possible to predict the PaO2 or PaO2/FiO2 ratio from the regression and classification tasks, respectively. Compared to the other measured variables, PEEP had the strongest correlation with PaO2/FiO2 (r = − 0.31) outside of the SF ratio (SpO2/FiO2) (Table 1). Using this information, we created a 3-feature model using SpO2, FiO2 and PEEP. As compared to seven features, three features were sufficient to impute PaO2/FiO2 ratio with a similar degree of accuracy. The 3-feature model was therefore utilized in the remainder of the analysis for the machine learning algorithms. The final 3-feature data set (dataset 2) contained 20,198 ICU encounters from 17,818 unique patients (Table 2). Forty percent of subjects were of female sex and the mean age was 64 years. The degree of hypoxemic respiratory failure, as measured by the PaO2/FiO2 ratio1, showed a distribution in which 26% had mild respiratory failure (PaO2/FiO2 = 201–300), 22% had moderate respiratory failure (PaO2/FiO2 = 101–200), and 8% had severe respiratory failure (PaO2/FiO2 ≤ 100).

Table 1 Correlation coefficients between PF ratios and variables.
Table 2 Subject characteristics based on three features.

Machine learning models show improved performance when compared to the prior published equations for regression

We quantitatively derived the RMSE for all of the machine learning and previously published models and the BIC for each of the three machine learning models to assess the performance of the different imputation techniques (Table 3). The RMSE of the neural network, linear regression and SVR machine learning models were 84.7, 88.8 and 85.9, respectively, compared to 117.7 and 91.8 for the log-linear and non-linear equations. The lower RMSE values indicate that the three machine learning models outperformed the previously published equations. Of the machine learning models, the neural network method showed the lowest RMSE as well as the lowest BIC in both the whole dataset (dataset 2) and for SpO2 < 97% (subset 2). A Bland–Altman Plot suggests that the neural network model is comparable to the published equations (Supplementary Fig. e2). There was decreasing accuracy at higher PaO2/FiO2 ratios for all the methods examined.

Table 3 RMSE and BIC of the 3-feature machine learning models regression tasks compared to published methods.

Machine learning models show improved performance for the classification task

We compared the performance of the machine learning models with the log-linear and non-linear equations using F1 scores. Similar to the findings for the regression task, all three machine learning models performed better in the whole dataset than log-linear and non-linear equations (Table 4). When the dataset was limited to SpO2 < 97% (subset 2), the machine-learning methods performed slightly better than log-linear and better than non-linear equations, respectively (Table 4). The F1 scores for all three machine learning methods were similar when using the whole dataset (dataset 2) and for subset 2 where SpO2 < 97%. As shown in Fig. 2, when comparing the 3 machine learning models to one another, the neural network preformed slightly better in the whole dataset (area under the precision recall curve = 0.94 for the neural network compared to 0.93 and 0.91 for the logistic regression and support vector machine model, respectively). The three models had similar performance in subset 2.

Table 4 Prediction performance of machine learning classification models based on three features.
Figure 2
figure 2

Precision-recall curves of machine learning models in Dataset 2 and Subset 2 using 3 features. The precision recall curves, where improved performance is demonstrated if the curve is closer to the upper right-hand corner or has the highest area under the curve (AUC), are shown for the 3 machine learning models for (A) the entire Dataset 2 (N = 20,198) ICU events) and (B) Subset 2 where SpO2 < 97% (N = 3280 ICU events). Data was obtained from the MIMIC-III database v1.4 (https://mimic.physionet.org).

Sensitivity analysis

Hidden hypoxemia, or the discrepancy between peripheral oxygen saturation (SpO2) measurements and the arterial oxygen saturation (SaO2) measured by ABG, was recently identified to occur in 5.3–5.5% of patients in the ICU setting23,24. Hidden hypoxemia, defined as SpO2 ≥ 88% despite an SaO2 ≤ 88%, was observed in all races and ethnic groups but occurs with higher prevalence in Black patients23,24. We conducted a sensitivity analysis to compare the performance of the machine learning models between self-reported Black and White race in dataset 2. For the regression task, among Black patients, machine learning algorithms outperformed both non-linear and log-linear equations in terms of the regression task (RMSE: 88.7, 91.1, 90.1, 117.4, and 95.8 for neural network, linear regression, SVR, log-linear, and non-linear models, respectively). Among machine learning algorithms, neural network revealed the highest performance in Black patients (Supplementary Table e4). Focusing on Black patients with SpO2 < 97% (subset 2), machine learning models showed superior performance over previously published equations (RMSE: 72.1, 74.4, 71.5, 85.0, and 95.6 for neural network, linear regression, SVR, log-linear, and non-linear models, respectively). The same pattern was observed for White patients in both the whole population and patients with SpO2 < 97% (subset 2) (RMSE in White patients: 84.6, 88.3, 85.9, 117.7, and 91.8; RMSE in White patients with SpO2 < 97%: 67.8, 68.3, 70.5, 72.2, and 81.2 for neural network, linear regression, SVR, log-linear, and non-linear models, respectively).

Considering the classification task, all machine learning algorithms performed better than or comparable to previously published equations in Black patients (F1: 0.93, 0.92, 0.93, 0.89, 0.92 for neural network, linear regression, SVR, log-linear, and non-linear models, respectively). Of note, neural network model performed slightly better than the other two machine learning algorithms in Black patients (AUC: 0.78, 0.77, 0.68 for neural network, logistic regression, and SVM model, respectively). Considering Black patients with SpO2 < 97% (subset 2), machine learning models outperformed conventional equations (F1: 0.82, 0.82, 0.84, 0.81, 0.73 for neural network, linear regression, SVR, log-linear, and non-linear models, respectively). Among White population, machine learning models outperformed conventional equations in both the whole population and patients with SpO2 < 97% (subset 2) (F1 in White patients: 0.92, 0.92, 0.92, 0.87, and 0.91; F1 in White patients with SpO2 < 97%: 0.81, 0.80, 0.81, 0.80, and 0.70 for neural network, linear regression, SVR, log-linear, and non-linear models, respectively), and neural network was the preferrable model. These findings are summarized in Supplementary Table e5.

Machine learning algorithms show a better accuracy in the validation cohort

We developed an online calculator using the three machine learning algorithms requiring three inputs (SpO2, FiO2, and PEEP): https://dikb.org/pa02-predictor. The calculator was then utilized in an independent validation cohort of 133 mechanically ventilated ICU patients to impute the PaO2 in a regression task. The imputed PaO2 was compared to the actual PaO2 obtained by ABG. The accuracy of the machine learning algorithms was compared to the non-linear equation and was reported as the RMSE and adjusted R-squared (Table 5). The neural network and SMV models had lower RMSE than the previously published non-linear equation, demonstrating improved performance in the imputation of PaO2. Adjusted R-squared was also higher in the neural network and SMV models. To clarify the models proposed in this study, the following example is worth mentioning: with the assumption of SpO2 = 100%, FiO2 = 0.6, and PEEP = 5 cmH2O (observed PaO2/FiO2 = 190), the predicted PaO2 is estimated as 203.0, 186.2, 188.4 using neural network, SVM, and regression models, respectively, while the estimate of conventional non-linear model is 167 (Table 6).

Table 5 RMSE of the 3-feature machine learning models regression task compared to the published non-linear equation.
Table 6 Examples of comparing four models applied to four cases from different categories of PaO2 (< 150, 150–200, 200–300, > 300).

Discussion

We used the publicly available MIMIC-III database as a derivation cohort to develop and evaluate machine-learning algorithms to impute PaO2 utilizing non-invasive SpO2 in patients who are mechanically ventilated. We tested three machine learning models (neural network, linear regression and SVR) first using seven available clinical variables SpO2, FiO2, PEEP, TV, MAP, temperature, and vasopressor administration to impute the PaO2. We subsequently used a parsimonious model with three clinical variables (SpO2, FiO2 and PEEP) to non-invasively impute PaO2 in both a derivation and validation cohort. The imputation of PaO2 from the regression tasks enabled us to derive the PaO2/FiO2, a clinically meaningful ratio with predictive value1,25. Additionally, we performed a classification task to predict PaO2/FiO2 ≤ 150, a cut off that has been used to capture those patients with moderate to severe respiratory failure in ARDS cohorts11,13 and to guide patient management5. To increase the clinical applicability of our work, we also developed an open-access online calculator to impute the PaO2 using the 3-feature model requiring only non-invasive bedside parameters in mechanically ventilated patients. Our calculator showed improved accuracy in the imputation of the PaO2 when compared to the previously published non-linear equation in both our initial cohort and the validation cohort.

To develop the machine learning algorithms, we initially evaluated clinical variables such as PEEP, TV, MAP, temperature, and vasopressor administration that are easily obtained at the bedside. TV, MAP, temperature and vasopressor use demonstrated a stochastic distribution and did not significantly alter the accuracy of the machine-learning based algorithms and were therefore removed to create the 3 features model (SpO2, FiO2, PEEP). This 3-feature model provides a framework for generalizability using large datasets of mechanically ventilated patients.

We considered other clinical variables such as skin pigmentation, pulse oximeter location, oximeter manufacturer, vasopressor infusion, and laboratory variables such as serum bicarbonate, serum chloride, serum creatinine, serum sodium but others have shown these variables provided negligible improvement in the accuracy of imputation in a prior prospective study11 and were therefore not included. However, it is worth mentioning that recent studies showed discrepancy between peripheral oxygen saturation (SpO2) measurements and the arterial oxygen saturation (SaO2) measured by ABG. This discrepancy, defined as SpO2 ≥ 88% despite an SaO2 ≤ 88% and referred to as hidden hypoxemia, was present in all racial and ethnic groups but showed higher prevalence in Black patients23,24. Considering this discrepancy between SpO2 and arterial oxygen saturation occurs more frequently in Black patients24, we performed a sensitivity analysis showing that our machine learning algorithms outperform previously published equations both in the Black and White race.

Our study shows that a machine learning based method for both the regression and classification task, when applied to the MIMIC-III critical care database, improved the accuracy compared to the previously published non-linear and log-linear imputation methods. As is evidenced by comparing the F1 and discrimination measures in Table 4, the performance improvement was more modest for the classification task in subset 2 where SpO2 < 97%. A possible explanation is that there were fewer ICU events (smaller N) per group in the subset.

Prior studies have examined the relationship between SF and PF ratios for patients with ARDS to determine whether the non-invasive SF ratio can be substituted for the invasively obtained PF ratio11,13,26. Panharipande, et al. studied matched measurements of SpO2 and PaO2 in a heterogeneous population (i.e., patients undergoing general anesthesia and patients with ARDS) to determine the association between SF and PF ratios in order to calculate the respiratory parameter of the SOFA score13. In their study, matched SpO2 and PaO2 values were obtained from two groups of patients: Group 1 comprised of the derivation set and was obtained from patients undergoing general anesthesia from a single center and Group 2 comprised a validation set utilizing data from patients enrolled in a multi-center randomized clinical trial examining low versus high tidal volume for acute respiratory management of ARDS (ARMA)27. All SpO2 values > 97% were also excluded from analysis in order to maximize matched data to those values likely to be within the linear range of the oxyhemoglobin dissociation curve. Data from 4728 matched SpO2 and PaO2 measurements showed that the relationship was best described by a log-linear equation with slight variation based upon the level of PEEP. In the setting of a more heterogeneous population, a poorer correlation was noted between SF and PF ratios. The regression equation of Log(PF) = 0.48 + 0.78 × Log(SF) yielded an R-square of 0.3113.

Additionally, a retrospective analysis of arterial blood gas measurements from three ARDS Network studies compared the performance of non-linear, log-linear and linear imputation methods to derive PaO2 from the SpO212. In all patients (N = 1184), the nonlinear imputation was equivalent to log-linear imputation. However, in those patients with SpO2 < 97% (N = 707), the nonlinear imputation showed lower error than either linear or log-linear equations. A prospective study was subsequently conducted in patients enrolled in the Prevention and Early Treatment of Acute Lung Injury network11 to assess the performance of the non-linear equation to impute PaO2 from the SpO2 and compare it to the prior log-linear and linear equations11,13,26. This study included 1034 arterial blood gases from 703 patients, of which 650 arterial blood gases had matched SpO2 < 97%. The non-linear equation showed lower error and better identified moderate to severe ARDS patients (defined in the study as PaO2/FiO2 ≤ 150) when compared to log-linear or linear imputation methods.

In our study, we similarly found a high degree of variance across SpO2 values and corresponding measured PaO2 values which was noted when we formally examined the relationship between SF and PF. This may be attributed to the retrospective nature of the data collection and the numerous variables that may confound the reliability of a recorded SpO2 measured non-invasively to reflect the arterial SaO28,10,12. Despite this limitation, the machine learning algorithms performed better on both regression and classification tasks when compared to the log-linear and non-linear published equations.

We used a validation cohort to show improved accuracy for the neural network and kernel-based machine learning algorithms when compared to the previously published non-linear equation. Another strength of our study is the development of an online calculator that can be used to impute the PaO2 from three noninvasive parameters (SpO2, FiO2 and PEEP) and may serve as a tool for future studies in large electronic health record datasets. Additionally, our machine learning models allow for the evaluation of all mechanically ventilated patients with available data rather than narrowing the analysis to a specific population such as those with ARDS. Given the inclusion of all mechanically ventilated patients, a significant number of SpO2 values were > 97% (N = 8510 for seven features and N = 16,918 for three features). While this reduced the accuracy of the imputed PF ratio, particularly above a certain threshold, the machine learning models were applied to the data without a pre-defined restriction placed upon the range of SpO2 values and showed better performance than both the log-linear and non-linear equations on both the regression and classification tasks.

Imputation of PaO2 from SpO2 has been increasingly implemented in clinical and research settings using previously published equations for subjects that do not have invasive ABG measurements readily available. This underscores the need to improve upon existing published equations and the clinical importance of machine learning models proposed. Machine learning models are currently being used to answer numerous clinical questions; these models have substantially impacted different scopes of medicine from early-warning systems for sepsis to imaging diagnostics24. Herein, we proposed three machine learning algorithms which can provide a framework for future investigations. The online calculator, on the other hand, can provide feasible prediction of PF ratio from SF ratio at the bedside for clinicians working in the critical care settings.

We showed that machine learning models outperformed previously published equations in terms of imputing PaO2 from SpO2 in the mechanically-ventilated adult population. Consistent with our findings, Sauthier et al., utilized neural network models to validate a continuous and noninvasive method of hypoxemia estimation in pediatric population28. They utilized convolutional neural network (CNN), long short-term memory network (LSTM), and multilayer perceptron (MLP) to impute PaO2. Intriguingly, they concluded that bias was lowered when using neural network models compared to mathematical equations.

In summary, any of the tested machine learning models applied to MIMIC-III dataset enabled imputation of PaO2 from the SpO2 with lower error and provided greater accuracy in predicting PaO2/FiO2 ≤ 150 across the entire range of SpO2 examined when compared to that of published equations in two independent cohorts. All machine learning models proposed in this paper outperformed log-linear and non-linear equations. Future work will be required to prospectively test ML algorithms for use in clinical practice. Additionally, our study provides a clinically relevant online calculator for the imputation of the PaO2 from the 3-feature machine learning models. The calculator requires the input of SpO2, FiO2, and PEEP all of which are non-invasive and readily available at the bedside of mechanically ventilated patients.