Abstract
Choosing optimal outcome measures maximizes statistical power, accelerates discovery and improves reliability in early-phase trials. We devised and evaluated a modification to a pragmatic measure of oxygenation function, the \(S/F\) ratio. Because of the ceiling effect in oxyhaemoglobin saturation, \(S/F\) ratio ceases to reflect pulmonary oxygenation function at high \({S}_{p}{O}_{2}\) values. We found that the correlation of \(S/F\) with the reference standard (\({P}_{a}{O}_{2}\)/\({F}_{I}{O}_{2}\) ratio) improves substantially when excluding \({S}_{p}{O}_{2} > 0.94\) and refer to this measure as \(S/{F}_{94}\). Using observational data from 39,765 hospitalised COVID-19 patients, we demonstrate that \(S/{F}_{94}\) is predictive of mortality, and compare the sample sizes required for trials using four different outcome measures. We show that a significant difference in outcome could be detected with the smallest sample size using \(S/{F}_{94}\). We demonstrate that \(S/{F}_{94}\) is an effective intermediate outcome measure in COVID-19. It is a non-invasive measurement, representative of disease severity and provides greater statistical power.
Similar content being viewed by others
Introduction
Therapeutic research in COVID-19 depends on efficient, accurate assessment of therapeutic candidates in early-stage clinical studies. Efficacy measures should be “clinically meaningful”1 endpoints, such as the WHO ordinal scale2. Intermediate endpoints for early phase trials, or severity measures for observational studies, must be modifiable by therapy and ideally should have a continuous numerical distribution to improve statistical power3. The endpoint should accurately predict the definitive outcome of interest and ideally should also be closely related to the causal pathway to this outcome.
In COVID-19, efficacy measures such as the WHO ordinal scale, duration of hospitalisation, and viral load have been used widely4,5. Both the WHO ordinal scale and various alternative ordinal scales6,7, rely on a complex clinical measure - the level of respiratory support received by a patient - as an indicator of illness severity. Viral load is a valid outcome for antiviral therapy, but it has not been shown to correlate with mortality benefit, and is not directly relevant to the effect of anti-inflammatory treatments8,9,10. In the RECOVERY trial, we identified a need for more powerful intermediate endpoints for early-phase clinical trials.
Impairment of the pulmonary oxygenation function indicates disease progression in COVID-1911, and is strongly predictive of mortality12. Importantly, in COVID-19, failure of pulmonary oxygenation is likely to be mechanistically linked to death: patients at extreme risk of mortality12 have high survival rates if oxygenation is provided by extracorporeal membrane oxygenation (ECMO)13. Pulmonary oxygenation function, together with clinical decision-making and resource availability, determines movement between most of the stages of the WHO Ordinal Scale (WHO scale points 4-9)2. Oxygenation function is a key determinant of efficacy for immunosuppression with corticosteroids in COVID-199. It is likely that pulmonary oxygenation function lies on the causal pathway between the SARS-CoV-2 infection and death for many hospitalised patients.
Peripheral oxygen saturation can be measured easily and non-invasively using a pulse oximeter (formally, arterial oxygen saturation measured by pulse oximetry, rather than direct measurement in blood, is \({S}_{p}{O}_{2}\)). The ratio of \({S}_{a}{O}_{2}\) or \({S}_{p}{O}_{2}\) to inspired fraction of oxygen (\({F}_{I}{O}_{2}\)), known as the \(S/F\) ratio, provides a continuous index of pulmonary oxygenation function which can be calculated without an arterial blood sample. \(S/F\) correlates well with the most widely-used arterial blood-derived measure of oxygenation - \(P/F\) ratio (\({P}_{a}{O}_{2}\)/\({F}_{I}{O}_{2}\))14. \(S/F\) under steady state conditions in humans can range from around \(0.5\) (severe oxygenation defect) to \(4.8\) (perfect oxygenation function). A major limitation of \(S/F\) is the ceiling effect: at high \({S}_{a}{O}_{2}\) values, \({S}_{a}{O}_{2}\) ceases to be dependent on pulmonary oxygenation function, because the blood is close to maximally oxygenated and the relationship between the P/F ratio and the S/F ratio is non-linear15,16. For example, a healthy patient with perfect lungs breathing \(21\%\) oxygen with \({S}_{a}{O}_{2}=0.99\) would have \(S/F=4.7\), but the same patient breathing \(100\%\) oxygen would have \(S/F=0.99\).
In order to improve the accuracy of measurement of lung oxygenation, we propose limiting the ceiling effect in prospective data by protocolising measurement of \({S}_{p}{O}_{2}\) to control high values or in retrospective (opportunistic) analyses by excluding values recorded with \({S}_{p}{O}_{2}\) above a given threshold value. We first evaluated an optimal threshold using both synthetic and real data from arterial blood gas (ABG) samples, predicting that \({S}_{p}{O}_{2}\le 0.94\) provides optimal predictive validity, at a level of induced hypoxia that is broadly acceptable to clinicians.
We defined the \(S/{F}_{94}\) measurement as \(S/F\) measured when \({S}_{a}{O}_{2}\le 0.94\) or \({F}_{I}{O}_{2}=0.21\). In opportunistic data, \(S/{F}_{94}\) can be estimated by excluding \({S}_{p}{O}_{2}\) values above 0.94 unless \({F}_{I}{O}_{2}=0.21\). In prospective, protocolised measurements, \({S}_{a}{O}_{2}\le 0.94\) can be achieved by reducing \({F}_{I}{O}_{2}\) to a minimum of \(=0.21\) (the fraction of oxygen in ambient air). Since many patients receive oxygen through devices for which \({F}_{I}{O}_{2}\) is not accurately quantified (e.g. Hudson mask, nasal cannulae), prospective studies measuring \(S/{F}_{94}\) will require a modification of oxygen delivery devices which, in itself, is expected to improve the accuracy of measurement (Appendix: Protocol).
In order to assess \(S/{F}_{94}\) as an outcome measure, we first used physiological model to evaluate the relationship with a reference standard, the \(P/F\) ratio. Second, we compared the predictive validity of \(S/{F}_{94}\) with several other measures of pulmonary oxygenation function, including the \(S/F\) ratio and the alveolar-arterial difference (A-a). We then used the ISARIC4C dataset to train models for a range of intermediate outcomes, including the WHO ordinal scale and \(S/{F}_{94}\), as predictors of 28-day mortality. We used these models to estimate sample sizes that would be required to see a given treatment effect. Finally, using data from the RECOVERY trial we estimated the expected improvement in the required sample size when using a protocolised, rather than opportunistic, \(S/{F}_{94}\) measurement.
Results
Relationship with the reference standard oxygenation measure (\({{{\boldsymbol{P}}}}{{{\boldsymbol{/}}}}{{{\boldsymbol{F}}}}\))
There is a consistent pattern in both synthetic (Fig. 1) and real (Supplementary Fig. 1) data: if no maximum cut-off value for \({S}_{a}{O}_{2}\) is used, spuriously low \(S/F\) values are seen in patients with good lung function, reflected in high \(P/F\) values (Fig. 1a, Supplementary Fig. 1a). This is due to the ceiling effect - \({S}_{a}{O}_{2}\) cannot rise above 100%. These misleading values are removed by excluding values with \({S}_{a}{O}_{2}\) above \(94\%\) (Fig. 1b, Supplementary Fig. 1b), which improves the correlation with the reference standard for both synthetic (Spearman \(S/F\): 0.40; \(S/{F}_{94}\): 0.85; Fig. 1d) and real data (Spearman \(r\) \(S/F\): 0.82; \(S/{F}_{94}\): 0.97, Supplementary Fig. 1c).
Predictive validity
In parallel, we assessed the predictive validity of \(S/F\) and \(S/{F}_{94}\). As in our previous work17, we assert that if \(S/{F}_{94}\) is measuring true oxygenation function well, then it should be able to more accurately predict a future event: the \({P}_{a}{O}_{2}\) value in a future arterial blood gas measurement taken from the same patient. We used a pre-existing dataset of unselected ABG result pairs from hospitalised patients, described in detail previously17. We quantified the MAE above baseline in \({P}_{a}{O}_{2}\) to quantify predictive validity, with lower error values indicating better performance (Fig. 1c, Supplementary Fig. 2). Across a range of maximum cut-off values for \({S}_{a}{O}_{2}\), the lowest MAE value was obtained at 94% (Fig. 1c; \(S/F\) MAE \(=4.41\) kPa (IQR: \(2.74\)-\(6.63\) kPa); \(S/{F}_{94}\) MAE = \(3.32\) kPa (IQR: \(1.87\)-\(5.26\) kPa), p(MWU) = \(3.7\times {10}^{-18}\)).
Evaluation in ISARIC4C data
\(39\),\(765\) cases in the ISARIC4C study had \({S}_{p}{O}_{2}\), \({F}_{I}{O}_{2}\) and clinical data available for analysis and met the inclusion criteria (see Methods). Mortality in this population was \(20.8\%\) (Table 1). Since measurement of \(S/{F}_{94}\) was not protocolised in ISARIC4C, measurements were obtained for patients for whom \({S}_{p}{O}_{2}\) happened to be \(\le 0.94\) or who were breathing room air (\({F}_{I}{O}_{2}\) = 0.21), therefore meeting the \(S/{F}_{94}\) definition. The conceptual advantage of \(S/{F}_{94}\) over \(S/F\) is that it offers a closer relationship to the pathophysiological process of interest. This is not expected to be apparent in the distribution of values observed, but rather in the sensitive detection of a real therapeutic effect. For this reason, and because of the risk of selection bias (see Methods), we did not undertake a direct comparison of patients meeting the criteria for \(S/{F}_{94}\) measurement, against patients who do not. Instead, we evaluated \(S/{F}_{94}\) against other commonly used outcome measures.
In order to select the timepoint of \(S/{F}_{94}\), several aspects were taken into account. Firstly, we looked at data availability. Within the ISARIC4C dataset, S/F values were available for the largest numbers of patients on days \(0\), \(2\), \(5\) and \(8\) from study enrolment. Second, among patients who remained in hospital, the distribution of \(S/{F}_{94}\) values moves over the first few days from study enrolment towards a bimodal pattern with high values in survivors, and low values in non-survivors (Fig. 2a). Finally, in order to make a meaningful comparison with the \(S/{F}_{94}\) at the day of enrolment, we preferred timepoints that were at least a few days after enrolment. We therefore chose day 5 as the primary timepoint for comparison. The distribution of measured \(S/{F}_{94}\) values and assigned maximum/ minimum values for those who were discharged/ died can be seen in (Fig. 2b). On day 5, 1077 out of 7,312 (14.7%) known \(S/{F}_{94}\) values were an assigned maximum/minimum value due to death/discharge. On day 8, 1948 out of 6079 (32.0%) known \(S/{F}_{94}\) values were an assigned maximum/minimum value. A sensitivity analysis excluding these assigned values is in the supplementary material.
An intermediate clinical outcome should have a strong association with a definitive outcome. Using \(28\)-day mortality as the definitive outcome, and including \(S/{F}_{94}\) values on both day \(0\) and day \(5\) as covariates in a linear regression model, we found a strong inverse association between \(S/{F}_{94}\) on day \(5\) and mortality: an increased risk of mortality at day \(28\) is associated with a lower value of \(S/{F}_{94}\) on day \(5\) (Fig. 2d). The OR for \(28\)-day mortality is \(0.25\) (\(95\%\) confidence interval \(0.23\)-\(0.28\)), meaning that for a 1 unit increase in \(S/{F}_{94}\) on day 5, the odds of \(28\)-day mortality decrease by \(75\%\).
We also compared \(S/{F}_{94}\) with a widely used intermediate outcome, the WHO scale. Since this scale records clinical decisions about therapy that are, in part, determined by the severity of hypoxic lung disease, a close relationship was expected with \(S/{F}_{94}\) (Fig. 2c). The distributions were consistent between patients meeting the inclusion criteria (Fig. 2c) and unselected patients (Supplementary Fig. 5a). The distribution of \(S/{F}_{94}\) values between outcomes at day 28 for patients meeting the inclusion criteria is similar on day 0 and day 5 (Supplementary Fig. 5b and Supplementary Fig. 5c). As expected, when there are no criteria for supplemental oxygen in the first 3 days since admission (unselected patients, Supplementary Fig. 5d and Supplementary Fig. 5e), there is a relative increase of patients with high \(S/{F}_{94}\) values on day 0.
Sample size estimation
Using the observed relationships in ISARIC4C data for eligible patients (see Methods), we quantified effect sizes associated with a \(15\%\) relative risk reduction in mortality for each of the following measures: \(S/{F}_{94}\) at \(5\) and \(8\) days after study enrolment, the WHO ordinal scale at \(5\) and \(8\) days after study enrolment, the proportion of patients who reached a sustained \(1\) or \(2\)-level improvement on the WHO ordinal scale, and a definitive outcome, \(28\)-day mortality. We chose a \(15\%\) relative risk reduction in mortality based on previous power calculations for the RECOVERY trial. We then estimated the sample sizes required to detect these effects with \(80\%\) power at \(2p=0.05\) (2p indicates a two-tailed test).
Some examples of sample size estimations using different inclusion criteria can be found in the supplementary material (Supplementary Table 2 and Supplementary Table 3). We created an online tool, using synthetic data with similar characteristics to the ISARIC4C data (see Methods), to enable users to test any combination of inclusion criteria (age, frailty score and type of respiratory support) and outcome assessment timepoint: https://isaric4c.net/endpoints.
For a \(15\%\) relative reduction in mortality, the required sample size was smallest for \(S/{F}_{94}\) on day \(5\), needing \(722\) patients in each arm (\(1\)\(444\) in total, Table 1). The number of subjects required for \(S/{F}_{94}\) on day \(8\) was higher, with \(1\),\(342\) subjects in each arm (Supplementary Table 4). For the WHO ordinal scale, \(1\),\(666\) participants would be required in each arm on day \(5\), or \(1\),\(168\) on day \(8\) to detect this mortality reduction. Required sample size was larger when 1-level sustained improvement was used as the outcome variable, with \(3\),\(378\) patients in each arm, and \(1\),\(904\) subjects in each arm when using 2-level sustained improvement (Table 1). Errors around the point estimates shown in Table 1 are shown in Fig. 3 for a range of effect sizes.
Estimated improvement with protocolised measurement of \(S/{F}_{94}\)
We have developed a protocol for measurement of \(S/{F}_{94}\) (Appendix: Protocol). Protocolising measurements is likely to substantially improve the accuracy of measurements of oxygenation function, firstly by ensuring that an oxygen delivery mode is used for which \({F}_{I}{O}_{2}\) can be accurately quantified (e.g. Venturi systems), and secondly by ensuring that measurements are taken at steady state. Protocolised measurement also permits inclusion of all patients, since \({F}_{I}{O}_{2}\) is decreased until \({S}_{p}{O}_{2}\le 0.94\), to a minimum of \({F}_{I}{O}_{2}=0.21\). We sought to estimate the magnitude of this improvement. We did this by fitting a measurement error model relating opportunistic and protocolised \(S/{F}_{94}\) measurements. A description of the estimation of effect size for the protocolised \(S/{F}_{94}\) measurement can be found in the supplementary methods. Based on this effect size estimate, the required sample size for a protocolised measurement of day 5 \(S/{F}_{94}\) would be around \(988\) subjects in total (Fig. 3).
Discussion
In synthetic (Fig. 1) and real (Supplementary Fig. 1) physiological data, we found that \({S}_{a}{O}_{2}\le 0.94\) is a pragmatic cut-off threshold, lying within a safe range, excluding the majority of obviously misleading values caused by the ceiling effect, and optimising predictive validity. Using observational data from the ISARIC4C study, we demonstrate that \(S/{F}_{94}\) fulfills our initial requirements for an intermediate outcome: a continuous outcome measure that is closely related to mortality and can be modified by therapy3. Testing predicted statistical power for a range of effect sizes in observational data, we found that \(S/{F}_{94}\) is more sensitive than other widely-used outcomes. Comparing both the WHO ordinal scale and \(S/{F}_{94}\) to the definitive outcome of mortality at day 28, we found that the same predicted treatment effect can be detected with fewer patients using \(S/{F}_{94}\), even when measurements are not protocolised.
In a clinical trial setting, where both \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) measurement can be protocolised, sensitivity is predicted to improve because protocolised measurement are less noisy and are therefore expected to have a stronger relationship with mortality. Using the SD for protocolised \(S/{F}_{94}\) during the RECOVERY trial, together with the assumed error measurement model relating protocolised and opportunistic \(S/{F}_{94}\) measurements, we predict a substantial additional improvement in statistical power using a protocolised measurement.
Our analyses may underestimate the statistical power of mortality, since time-to-event analyses would be used in most circumstances to maximise statistical power. Due to the large proportion of missing data after day 10, it was not possible to carry out survival modelling in our data. Ideally, we would have performed a mediation analysis with treatment effect, to determine the extent to which the treatment effect on mortality is explained by the intermediate endpoint \(S/{F}_{94}\). However, since there is no \(S/{F}_{94}\) data available from clinical studies showing significant treatment effect, it is not possible to perform this analysis.
Some important sources of error exist in the outcome measures we considered. Firstly, \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) are both subject to measurement error, particularly in opportunistic data. For example, estimating \({F}_{I}{O}_{2}\) for patients receiving supplemental oxygen via nasal cannula or simple (Hudson) masks is inaccurate, because the \({F}_{I}{O}_{2}\) is profoundly affected by inspiratory flow rate, which varies between patients. This error would be eliminated by protocolised measurement, which mandates the use of devices delivering a fixed \({F}_{I}{O}_{2}\). Secondly, the position of a patient on the ordinal WHO scale is influenced by both availability of resources and the decision by the patient and the clinician whether to escalate the level of care or provide organ support. This may explain the wide range of \(S/{F}_{94}\) values for patients at the same position on the WHO scale.
There are multiple advantages of using \(S/{F}_{94}\) as an intermediate outcome measure in a phase II clinical trial in hospitalised patients. It is an easy, non-invasive measurement, using near-ubiquitous monitoring equipment. In contrast, daily \({P}_{a}{O}_{2}\) measurements (from an arterial blood sample) are time-consuming, require highly skilled staff, and are burdensome for patients unless an indwelling arterial catheter is present (unusual outside of critical care areas). It is likely that the results of recent and ongoing clinical trials suggesting harm from hyperoxia will, in future, mean high \({S}_{a}{O}_{2}\) values a less common finding, particularly in the intensive care unit.
In order to determine the utility of a surrogate outcome in clinical trials, a distinction can be made between “individual level surrogacy” and “trial-level surrogacy”18. If there is an association between the surrogate and the outcome of interest in individual patients, the surrogate works on an individual level. If the effect that a treatment has on the surrogate can be used to predict the causal effect treatment has on the outcome, there is also trial level surrogacy. There are some scenarios, as explained by Buyse and colleagues18, in which there is individual-level surrogacy but no trial level surrogacy, for example due to (known and unknown) confounders, or treatment being dependent on the surrogate (e.g. low \(S/{F}_{94}\) values could lead to additional interventions that influence the outcome, confounding the influence of treatment on outcome). Trial-level surrogacy can be demonstrated with data from (multiple) randomised controlled trials. With the data we have available, we can thus only show individual-level surrogacy and not trial-level surrogacy. Determining whether \(S/{F}_{94}\) is also a trial-level surrogate would be a desirable objective for further studies.
Of the pragmatic endpoints available from routinely collected data, the WHO ordinal scale is the best-performing endpoint. In studies where clinical observations can be obtained, \(S/{F}_{94}\) is a robust measure of pulmonary oxygenation function, and is the best measure to optimise statistical power for comparisons. \(S/{F}_{94}\) is comparable to the P/F ratio as a measure of pulmonary oxygenation, and superior to \({S}_{p}{O}_{2}\)/\({F}_{I}{O}_{2}\) ratio. Where protocolised measurements can be obtained, further improvements in statistical power are expected. \(S/{F}_{94}\) may have utility in clinical studies of other disease processes where pulmonary oxgenation failure contributes to mortality, such as influenza and ARDS19.
In conclusion, \(S/{F}_{94}\) is a powerful and robust intermediate endpoint for clinical studies of COVID-19 and may have broad utility in forms of acute lung injury.
Methods
Ethical approval
All research described in this study complies with all relevant ethical regulations. Ethical approval was given by the South Central-Oxford C Research Ethics Committee in England (13/SC/0149), the Scotland A Research Ethics Committee (20/SS/0028), and the WHO Ethics Review Committee (RPC571 and RPC572, April 2013). In England and Wales, consent was not required for the collection of depersonalised routine healthcare research data. In Scotland, a waiver for consent was given by the Public Benefit and Privacy Panel.
Relationship to the reference standard (\({{{\boldsymbol{P}}}}{{{\boldsymbol{/}}}}{{{\boldsymbol{F}}}}\) ratio)
The \(P/F\) ratio is the oxygenation measure used in diagnostic criteria for acute respiratory failure, and is used in our analysis as the reference standard20. We evaluated the relationship between \(S/F\) and \(P/F\) in two datasets: a synthetic dataset of \(1\),\(529\),\(176\) predictions covering a wide range of possible physiological variation, generated by a mathematical model of oxygen delivery written in Python (available at https://github.com/baillielab/oxygen_delivery) and reported previously17, and \(72\),\(457\) unselected arterial blood gas results from a critically ill population17. Taking \(P/F\) to be our reference standard, we evaluated \(S/F\) at different thresholds in both synthetic and real data.
Predictive validity
We considered the predictive validity of \(S/F\) and \(S/{F}_{94}\) compared to \(P/F\) and two other measures of oxygenation function: the A-a, and effective shunt fraction (ES)17.
Predictive validity quantifies the extent to which a clinical measurement predicts an unseen event. The aim is not to optimise prediction, but to test the extent to which a measurement is describing a real feature of the patient’s illness21. In this case, we contend that a measure that accurately describes pulmonary oxygenation function will accurately predict \({P}_{a}{O}_{2}\) after a change is made to \({F}_{I}{O}_{2}\). Using the same pre-existing dataset of ABG results from critically ill patients as in our previous study17, we used this approach to assess the validity of \(S/F\) and \(S/{F}_{94}\).
Briefly, in pairs of arterial blood gas results taken from the same patient <3 h apart, in which \({F}_{I}{O}_{2}\) was decreased in the later sample, we used various measures of oxygenation (A-a, \(P/F\), ES, \(S/F\)) in the first ABG to predict the \({P}_{a}{O}_{2}\) in the second sample and compared these predicted values with the \({P}_{a}{O}_{2}\) that was measured in the second sample. Predictive validity was quantified by the median absolute error (MAE). A baseline value, showing the difference between ABG results for matched pairs in which \({F}_{I}{O}_{2}\) did not change, is provided to contextualise the MAE results as a reasonable minimum error value. Results are presented as difference in MAE from this baseline. The Mann-Whitney U-test (MWU) was used for the comparison of MAE difference from baseline.
Evaluation in ISARIC4C data
Inclusion criteria
All subjects were part of the ISARIC Coronavirus Clinical Characterisation Consortium (ISARIC4C) WHO Clinical Characterisation Protocol UK (CCP- UK), a study in England, Wales, and Scotland prospectively collecting data from patients hospitalised with SARS-CoV-2 infection since the start of the pandemic.
In order to focus our assessment on the subset of patients with hypoxaemic respiratory failure that is potentially modifiable by anti-inflammatory treatment, we repeated all analyses in subjects aged \(20\)-\(75\) who required supplementary oxygen therapy within 3 days of hospital admission, subjects aged \(20\)-\(75\) that were oxygen dependent on the day of admission, and subjects aged \(20\)-\(75\) without criteria for oxygen dependency. All included patients had \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) data available. While \({S}_{p}{O}_{2}\) is typically represented as a percentage, for \(S/{F}_{94}\) it is used as a fraction, with values ranging from 0-1.
Estimation of \(S/{F}_{94}\) in observational data
The \(S/F\) ratio was calculated by dividing \({S}_{p}{O}_{2}\) by \({F}_{I}{O}_{2}\) (with both as fractions, taking values between 0 and 1). For this evaluation, \(S/{F}_{94}\) was defined as an opportunistic measurement in which \({S}_{p}{O}_{2}\le 0.94\), or the patient was receiving no supplementary oxygen (\({F}_{I}{O}_{2}=0.21\)).
Importantly, the retrospectively-defined subgroup of patients meeting the \(S/{F}_{94}\) criteria is not representative of all patients since there was an excess of patients who were not receiving respiratory support, with slight excess mortality, in the \(S/{F}_{94}\) group (Supplementary Table 1). This indicates at least two mechanisms of selection bias, acting in opposite directions, and precluding a direct comparison. Firstly, patients who have high blood oxygen levels on relatively little supplementary oxygen are excluded from the \(S/{F}_{94}\) group; by definition these patients have relatively mild disease. Secondly, the group in whom \(S/{F}_{94}\) could be measured includes patients who receive supplemental oxygen, and fail to reach adequate \({S}_{p}{O}_{2}\) values, but are not escalated to a higher level of respiratory support; this is a frail and multimorbid population with very severe disease.
\(S/{F}_{94}\) was calculated at baseline (day 0) and on day 5 and day 8 from study enrolment. There is expected to be differential missingness between \(S/{F}_{94}\) and mortality: \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) data are only available for a proportion of cases, whereas outcome data is well-recorded. Patients who died or were discharged on given day and had a missing value for \(S/{F}_{94}\) were assigned values 0.5 (severe oxygenation defect) and 4.76 (perfect oxygenation), respectively. However, death/discharge was more likely to be recorded than \(S/{F}_{94}\), and this could introduce bias into our analysis. We addressed this by estimating the proportion of patients for whom \(S/{F}_{94}\) measurements were available among those who had not died or been sent home by a given day. We then resampled those who died/discharged according to these proportions. For example, if on day 5, 20\(\%\) of those who had not died or discharged had \(S/{F}_{94}\) measurements available, we randomly resampled 20\(\%\) of those who died/had been discharged by then, assigning \(S/{F}_{94}=0.5\) to those who died, and \(S/{F}_{94}=4.76\) to those who were discharged.
Association between \(S/{F}_{94}\) and 28-day mortality
Two key assumptions underlie the use of \(S/{F}_{94}\) as an intermediate endpoint. Firstly, that pulmonary oxygenation function predicts mortality in COVID-19, and secondly, that \(S/{F}_{94}\) accurately reflects the pulmonary oxygenation function. If either of these assumptions are violated, then a strong relationship between \(S/{F}_{94}\) and subsequent mortality would not be expected.
To evaluate this association, a logistic regression model was developed with 28-day all-cause mortality as the dependent variable and \(S/{F}_{94}\) measured on day 0 and day 5 as two separate covariates. We included both \(S/{F}_{94}\) on day 0 and day 5 due to the strong relationship between \(S/{F}_{94}\) on day 0 and \(S/{F}_{94}\) on days further in the disease trajectory. Linear dependence of log-odds on \(S/{F}_{94}\) measured on day 0 and day 5 was assessed both by visual inspection and using model selection criteria including the Bayesian Information Criterion (BIC) to compare to a restricted splines model. Finally, predicted models were made to assess the absolute change in risk of mortality with a change in \(S/{F}_{94}\).
Sample size calculations
We compared the sample sizes required for a range of different outcomes measures (\(S/{F}_{94}\), WHO ordinal scale, sustained improvement at day 28 and 28-day mortality). For the intermediate endpoints, we estimated the treatment effect associated with a \(15\%\) relative reduction in mortality. Below we give brief descriptions of the effect size calculations for the different outcome measures. All calculations assumed a \(1\):\(1\) allocation of participants between treatment and control groups and are based on having \(80\%\) power at 2p\(=0.05\) to detect the stated treatment effect. Details on effect size estimation can be found in the supplementary material.
Quantifying uncertainty
We bootstrapped \(95\%\) confidence for the effect size, and then used this to calculate \(95\%\) confidence intervals for required sample size using the fact that they are monotonically related.
Continuous variables (\(S/{F}_{94}\))
We fit a logistic regression with mortality at day 28 as the dependent variable, and age, sex, \(S/{F}_{94}\) on day 0 (baseline) and day 5 (or day 8) as independent variables. We used this to calculate the predicted probability of mortality, and the change in \(S/{F}_{94}\) associated with a relative reduction in predicted mortality of \(15\%\), for each subject. Finally, we took the mean to find the average change in day 5 \(S/{F}_{94}\) that is associated with a \(15\%\) reduction in mortality across the sample. This was the target treatment effect in the clinical trial. We calculated the sample size required to see this treatment effect with a given level of power using a two sample t-test with ANCOVA correction for the correlation between \(S/{F}_{94}\) on day 0 and day 522.
Ordinal variables (WHO scale)
Values for the WHO ordinal scale were derived using information about oxygen support and mortality. Possible values in hospitalised patients range between 4 and 102.
WHO scale - absolute value
We fitted a proportional odds model with the WHO ordinal scale as the dependent variable, and age and sex as independent variables. We used this model to estimate the odds ratio associated with a \(15\%\) relative reduction in mortality23.
WHO scale - sustained improvement
We derived binary variables for sustained 1- or 2-level improvement on the WHO scale. To be considered sustained, an improvement had to be maintained until discharge or until day 28. We fitted a logistic regression model with mortality at day 28 as the dependent variable, and age, sex and sustained 1- or 2-level improvement on the WHO scale as independent variables. We used this model to estimate the difference in proportion of people who had a sustained improvement on the WHO ordinal scale that was associated with a \(15\%\) reduction in risk of mortality. We then calculated required sample size for this outcome using a two-sample test for proportions with a continuity correction24. Only patients who had WHO ordinal scale values on at least two separate days were included in this analysis.
Mortality
In order to compare these alternative outcome measures with a definitive outcome (mortality), we calculated the number of participants needed if 28-day mortality was the outcome measure, using a two-sample test for proportions with continuity correction.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Source data are provided for Fig. 1 and supplementary figure 1 and 2. The dataset used and analysed in this study contains clinical data about individuals and is available after a data access request. Data access request and details on the procedure can be found at https://odap.ac.uk/researchers. Data access requests will be reviewed on the basis of scientific merit and validity, the proposed timeline, ethical considerations and the available resources. Access requests can be send to odap@ed.ac.uk. A reply to a data access request will be provided within six weeks from the date of the request. Depending on the requested data, there may be additional steps before data can be published, such as agreement from all contributors. For details, please see https://odap.ac.uk/researchers. All data supporting the findings in this manuscript are present in the main text, supplementary material, the source data and from the corresponding author upon request. A synthetically generated dataset, containing the same key properties as the original dataset is available for sample size calculations on https://isaric4c.net/endpoints Source data are provided with this paper.
Code availability
The code used to do the analyses can be found on github https://github.com/baillielab/SF94.
References
U.S. Food and Drug Administration. COVID-19: Developing Drugs and Biological Products for Treatment or Prevention. Guidance for Industry (2020). FDA-2020-D-1370. https://www.fda.gov/media/167274/download
WHO Working Group on the Clinical Characterisation and Management of COVID-19 infection. A minimal common outcome measure set for COVID-19 clinical research. Lancet Infect. Dis. 20, 192–197 (2020).
Dodd, L. E. et al. Endpoints for randomized controlled clinical trials for COVID-19 treatments. Clin. trials (Lond., Engl.) 17, 472–482 (2020).
Horby, P. et al. Effect of hydroxychloroquine in hospitalized patients with covid-19. N. Engl. J. Med. 383, 2030–2040 (2020).
Beigel, J. H. et al. Remdesivir for the treatment of covid-19 - final report. N. Engl. J. Med. 383, 1813–1826 (2020).
Davoudi-Monfared, E. et al. A Randomized Clinical Trial of the Efficacy and Safety of Interferon β-1a in Treatment of Severe COVID-19. Antimicrob. Agents Chemother. 64, e01061–20 (2020).
Cao, B. et al. A trial of lopinavir-ritonavir in adults hospitalized with severe covid-19. N. Engl. J. Med. 382, 1787–1799 (2020).
Abani, O. et al. Tocilizumab in patients admitted to hospital with COVID-19 (RECOVERY): A randomised, controlled, open-label, platform trial. Lancet 397, 1637–1645 (2021).
Horby, P. et al. Dexamethasone in Hospitalized Patients with Covid-19 Preliminary Report. New England Journal of Medicine https://doi.org/10.1056/NEJMoa2021436 (2020).
Abani, O. et al. Baricitinib in patients admitted to hospital with COVID-19 (RECOVERY): A randomised, controlled, open-label, platform trial and updated meta-analysis. Lancet 400, 359–368 (2022).
Docherty, A. B. et al. Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study. BMJ 369, m1985 (2020).
Knight, S. R. et al. Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: Development and validation of the 4C Mortality Score. BMJ 370, m3339 (2020).
Barbaro, R. P. et al. Extracorporeal membrane oxygenation support in COVID-19: An international cohort study of the Extracorporeal Life Support Organization registry. Lancet 396, 1071–1078 (2020).
Kwack, W. G. et al. Evaluation of the SpO2/FiO2 ratio as a predictor of intensive care unit transfers in respiratory ward patients for whom the rapid response system has been activated. PloS one 13, e0201632 (2018).
Brown, S. M. et al. Nonlinear Imputation of Pao2/Fio2 From Spo2/Fio2 Among Patients With Acute Respiratory Distress Syndrome. Chest 150, 307–313 (2016).
Brown, S. M. et al. Nonlinear Imputation of PaO2/FIO2 From SpO2/FIO2 Among Mechanically Ventilated Patients in the ICU: A Prospective, Observational Study. Crit. Care Med. 45, 1317–1324 (2017).
Chang, E. M., Bretherick, A., Drummond, G. B. & Baillie, J. K. Predictive validity of a novel non-invasive estimation of effective shunt fraction in critically ill patients. Intensive Care Med. Exp. 7, 49 (2019).
Buyse, M., Saad, E. D., Burzykowski, T., Regan, M. M. & Sweeney, C. S. Surrogacy beyond prognosis: The importance of “trial-level” surrogacy. Oncologist 27, 266–271 (2022).
Saha, R. et al. Estimating the attributable fraction of mortality from acute respiratory distress syndrome to inform enrichment in future randomised clinical trials. Thorax https://doi.org/10.1136/thorax-2023-220262 (2023).
Carvalho, E. B. de et al. Rationale and limitations of the SpO2/FiO2 as a possible substitute for PaO2/FiO2 in different preclinical and clinical scenarios. Rev. Bras. Ter. Intensiva 34, 185–196 (2022).
Ferguson, N. D. et al. The berlin definition of ARDS: An expanded rationale, justification, and supplementary material. Intensive care Med. 38, 1573–1582 (2012).
Borm, G. F., Fransen, J. & Lemmens, W. A. J. G. A simple sample size formula for analysis of covariance in randomized clinical trials. J. Clin. Epidemiol. 60, 1234–1238 (2007).
Harrell Jr, F. E., and with contributions from Charles Dupont and many others. Hmisc: Harrell miscellaneous. (2020). R package version 4.4-2. https://CRAN.R-project.org/package=Hmisc.
Wittes, J. Sample size calculations for randomized controlled trials. Epidemiol. Rev. 24, 39–53 (2002).
Acknowledgements
This work uses data provided by patients and collected by the NHS as part of their care and support. We are extremely grateful for the front-line NHS clinical and research staff and volunteer medical students who collected this data in challenging circumstances, and the generosity of the participants and their families for their individual contributions in these difficult times. We also acknowledge the support of Jeremy J Farrar (Wellcome Trust) and Nahoko Shindo (WHO). For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. JKB gratefully acknowledges funding support from a Wellcome Trust Senior Research Fellowship (223164/Z/21/Z), UKRI grants MC_PC_20004, MC_PC_19025, MC_PC_1905, MRNO2995X/1, and MC_PC_20029, Sepsis Research (Fiona Elizabeth Agnew Trust), a BBSRC Institute Strategic Programme Grant to the Roslin Institute (BB/P013732/1, BB/P013759/1), and the UK Intensive Care Society. ISARIC4C work was supported by the National Institute for Health Research (NIHR), the Medical Research Council [MC_PC_19059] and by the NIHR Health Protection Research Unit (HPRU) in Emerging and Zoonotic Infections at University of Liverpool in partnership with Public Health England (PHE), in collaboration with Liverpool School of Tropical Medicine and the University of Oxford [200907], NIHR HPRU in Respiratory Infections at Imperial College London with PHE [200927], Wellcome Trust and Department for International Development [215091/Z/18/Z], and the Bill and Melinda Gates Foundation [OPP1209135], and Liverpool Experimental Cancer Medicine Centre (C18616/A25153), NIHR Biomedical Research Centre at Imperial College London [IS-BRC-1215-20013], EU Platform for European Preparedness Against (Re-) emerging Epidemics (PREPARE) [FP7 project 602525] and NIHR Clinical Research Network for providing infrastructure support for this research.
Author information
Authors and Affiliations
Consortia
Contributions
J.K.B. and P.H. conceived the study. J.K.B., M.G.S. and P.J.M.O. acquired funding. J.K.B., P.W.H., F.M., N.Y., J.D., A.D.B., J.M., J.S.N.-V.-T., P.W.H. and M.G.S. designed the analysis. E.M.H., R.G., E.S., A.B.D., D.H., K.R., N.S. and N.L. provided guidance on methodology and interpretation. M.C.S., S.K., A.B.B., N.S. and J.K.B. did the formal analysis. J.S.B. and S.K. created the website. E.H., A.B.D., G.H.G., N.L., N.S. and J.K.B. supervised the work. M.C.S., S.K. and J.K.B. wrote the original draft of the manuscript. All authors reviewed and gave feedback on the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
JKB and ABD report grants from the UK Department of Health and Social Care (DHSC), during the conduct of the study, and grants from Wellcome Trust,. PJMO reports personal fees from consultancies (GlaxoSmithKline, Janssen, Bavarian Nordic, Pfizer, and Cepheid) and from the European Respiratory Society, grants from MRC, MRC Global Challenge Research Fund, the EU, NIHR BRC, MRC–GlaxoSmithKline, Wellcome Trust, NIHR (HPRU in Respiratory Infection), and is an NIHR senior investigator outside the submitted work. PJMO’s role as President of the British Society for Immunology was unpaid but travel and accommodation at some meetings was provided by the Society. JKB reports grants from MRC. MGS reports grants from DHSC, NIHR UK, MRC, HPRU in Emerging and Zoonotic Infections, and University of Liverpool, during the conduct of the study, and is chair of the scientific advisory board and a minority shareholder at Integrum Scientific, outside the submitted work. JSN-V-T was seconded to the Department of Health and Social Care, England (DHSC), October 2017-March 2022. The views expressed in this manuscript are those of its authors and not necessarily those of DHSC. JSN-V-T reports personal fees and travel and accommodation from AstraZeneca. NS reports grants from Boehringer Ingleheim and Novo Nordisk outside the submitted work. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Tommaso Mauri, David Leaf, Jean-Louis Vincent and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Swets, M.C., Kerr, S., Scott-Brown, J. et al. Evaluation of pragmatic oxygenation measurement as a proxy for Covid-19 severity. Nat Commun 14, 7374 (2023). https://doi.org/10.1038/s41467-023-42205-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-023-42205-6
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.