Introduction

Therapeutic research in COVID-19 depends on efficient, accurate assessment of therapeutic candidates in early-stage clinical studies. Efficacy measures should be “clinically meaningful”1 endpoints, such as the WHO ordinal scale2. Intermediate endpoints for early phase trials, or severity measures for observational studies, must be modifiable by therapy and ideally should have a continuous numerical distribution to improve statistical power3. The endpoint should accurately predict the definitive outcome of interest and ideally should also be closely related to the causal pathway to this outcome.

In COVID-19, efficacy measures such as the WHO ordinal scale, duration of hospitalisation, and viral load have been used widely4,5. Both the WHO ordinal scale and various alternative ordinal scales6,7, rely on a complex clinical measure - the level of respiratory support received by a patient - as an indicator of illness severity. Viral load is a valid outcome for antiviral therapy, but it has not been shown to correlate with mortality benefit, and is not directly relevant to the effect of anti-inflammatory treatments8,9,10. In the RECOVERY trial, we identified a need for more powerful intermediate endpoints for early-phase clinical trials.

Impairment of the pulmonary oxygenation function indicates disease progression in COVID-1911, and is strongly predictive of mortality12. Importantly, in COVID-19, failure of pulmonary oxygenation is likely to be mechanistically linked to death: patients at extreme risk of mortality12 have high survival rates if oxygenation is provided by extracorporeal membrane oxygenation (ECMO)13. Pulmonary oxygenation function, together with clinical decision-making and resource availability, determines movement between most of the stages of the WHO Ordinal Scale (WHO scale points 4-9)2. Oxygenation function is a key determinant of efficacy for immunosuppression with corticosteroids in COVID-199. It is likely that pulmonary oxygenation function lies on the causal pathway between the SARS-CoV-2 infection and death for many hospitalised patients.

Peripheral oxygen saturation can be measured easily and non-invasively using a pulse oximeter (formally, arterial oxygen saturation measured by pulse oximetry, rather than direct measurement in blood, is \({S}_{p}{O}_{2}\)). The ratio of \({S}_{a}{O}_{2}\) or \({S}_{p}{O}_{2}\) to inspired fraction of oxygen (\({F}_{I}{O}_{2}\)), known as the \(S/F\) ratio, provides a continuous index of pulmonary oxygenation function which can be calculated without an arterial blood sample. \(S/F\) correlates well with the most widely-used arterial blood-derived measure of oxygenation - \(P/F\) ratio (\({P}_{a}{O}_{2}\)/\({F}_{I}{O}_{2}\))14. \(S/F\) under steady state conditions in humans can range from around \(0.5\) (severe oxygenation defect) to \(4.8\) (perfect oxygenation function). A major limitation of \(S/F\) is the ceiling effect: at high \({S}_{a}{O}_{2}\) values, \({S}_{a}{O}_{2}\) ceases to be dependent on pulmonary oxygenation function, because the blood is close to maximally oxygenated and the relationship between the P/F ratio and the S/F ratio is non-linear15,16. For example, a healthy patient with perfect lungs breathing \(21\%\) oxygen with \({S}_{a}{O}_{2}=0.99\) would have \(S/F=4.7\), but the same patient breathing \(100\%\) oxygen would have \(S/F=0.99\).

In order to improve the accuracy of measurement of lung oxygenation, we propose limiting the ceiling effect in prospective data by protocolising measurement of \({S}_{p}{O}_{2}\) to control high values or in retrospective (opportunistic) analyses by excluding values recorded with \({S}_{p}{O}_{2}\) above a given threshold value. We first evaluated an optimal threshold using both synthetic and real data from arterial blood gas (ABG) samples, predicting that \({S}_{p}{O}_{2}\le 0.94\) provides optimal predictive validity, at a level of induced hypoxia that is broadly acceptable to clinicians.

We defined the \(S/{F}_{94}\) measurement as \(S/F\) measured when \({S}_{a}{O}_{2}\le 0.94\) or \({F}_{I}{O}_{2}=0.21\). In opportunistic data, \(S/{F}_{94}\) can be estimated by excluding \({S}_{p}{O}_{2}\) values above 0.94 unless \({F}_{I}{O}_{2}=0.21\). In prospective, protocolised measurements, \({S}_{a}{O}_{2}\le 0.94\) can be achieved by reducing \({F}_{I}{O}_{2}\) to a minimum of \(=0.21\) (the fraction of oxygen in ambient air). Since many patients receive oxygen through devices for which \({F}_{I}{O}_{2}\) is not accurately quantified (e.g. Hudson mask, nasal cannulae), prospective studies measuring \(S/{F}_{94}\) will require a modification of oxygen delivery devices which, in itself, is expected to improve the accuracy of measurement (Appendix: Protocol).

In order to assess \(S/{F}_{94}\) as an outcome measure, we first used physiological model to evaluate the relationship with a reference standard, the \(P/F\) ratio. Second, we compared the predictive validity of \(S/{F}_{94}\) with several other measures of pulmonary oxygenation function, including the \(S/F\) ratio and the alveolar-arterial difference (A-a). We then used the ISARIC4C dataset to train models for a range of intermediate outcomes, including the WHO ordinal scale and \(S/{F}_{94}\), as predictors of 28-day mortality. We used these models to estimate sample sizes that would be required to see a given treatment effect. Finally, using data from the RECOVERY trial we estimated the expected improvement in the required sample size when using a protocolised, rather than opportunistic, \(S/{F}_{94}\) measurement.

Results

Relationship with the reference standard oxygenation measure (\({{{\boldsymbol{P}}}}{{{\boldsymbol{/}}}}{{{\boldsymbol{F}}}}\))

There is a consistent pattern in both synthetic (Fig. 1) and real (Supplementary Fig. 1) data: if no maximum cut-off value for \({S}_{a}{O}_{2}\) is used, spuriously low \(S/F\) values are seen in patients with good lung function, reflected in high \(P/F\) values (Fig. 1a, Supplementary Fig. 1a). This is due to the ceiling effect - \({S}_{a}{O}_{2}\) cannot rise above 100%. These misleading values are removed by excluding values with \({S}_{a}{O}_{2}\) above \(94\%\) (Fig. 1b, Supplementary Fig. 1b), which improves the correlation with the reference standard for both synthetic (Spearman \(S/F\): 0.40; \(S/{F}_{94}\): 0.85; Fig. 1d) and real data (Spearman \(r\) \(S/F\): 0.82; \(S/{F}_{94}\): 0.97, Supplementary Fig. 1c).

Fig. 1: Comparison of P/E and S/F or S/F94 in synthetic data.
figure 1

a, b Scatterplots of \(P/F\) vs \(S/F\) individual measurements across a range of hypothetical physiological characteristics. Points are coloured according the \({S}_{a}{O}_{2}\) as shown in the colour scale. (a) including all values, showing linear regression of \(S/F\) against \(P/F\) in using different cut-off values for \({S}_{a}{O}_{2}\). Patients breathing air (\({F}_{I}{O}_{2}\) = \(21\%\)) were included in all bins. (b) including only values with \({S}_{p}{O}_{2}\le 0.94\) or \({F}_{I}{O}_{2}=21\%\) (c) Optimisation of cut-off value for \({S}_{a}{O}_{2}\) using predictive validity: the error in the prediction of a future \({P}_{a}{O}_{2}\), based on a previous one (using a pre-existing dataset of ABG results17). Centre line represents median values, box limits represent upper and lower quartiles, whiskers represent minimum and maximum values. (d) change in correlation coefficient (Pearson’s \(\rho \)) as the threshold for inclusion is lowered from \({S}_{a}{O}_{2} < 100\%\) to \({S}_{a}{O}_{2} \, < \, 80\%\).

Predictive validity

In parallel, we assessed the predictive validity of \(S/F\) and \(S/{F}_{94}\). As in our previous work17, we assert that if \(S/{F}_{94}\) is measuring true oxygenation function well, then it should be able to more accurately predict a future event: the \({P}_{a}{O}_{2}\) value in a future arterial blood gas measurement taken from the same patient. We used a pre-existing dataset of unselected ABG result pairs from hospitalised patients, described in detail previously17. We quantified the MAE above baseline in \({P}_{a}{O}_{2}\) to quantify predictive validity, with lower error values indicating better performance (Fig. 1c, Supplementary Fig. 2). Across a range of maximum cut-off values for \({S}_{a}{O}_{2}\), the lowest MAE value was obtained at 94% (Fig. 1c; \(S/F\) MAE \(=4.41\) kPa (IQR: \(2.74\)-\(6.63\) kPa); \(S/{F}_{94}\) MAE = \(3.32\) kPa (IQR: \(1.87\)-\(5.26\) kPa), p(MWU) = \(3.7\times {10}^{-18}\)).

Evaluation in ISARIC4C data

\(39\),\(765\) cases in the ISARIC4C study had \({S}_{p}{O}_{2}\), \({F}_{I}{O}_{2}\) and clinical data available for analysis and met the inclusion criteria (see Methods). Mortality in this population was \(20.8\%\) (Table 1). Since measurement of \(S/{F}_{94}\) was not protocolised in ISARIC4C, measurements were obtained for patients for whom \({S}_{p}{O}_{2}\) happened to be \(\le 0.94\) or who were breathing room air (\({F}_{I}{O}_{2}\) = 0.21), therefore meeting the \(S/{F}_{94}\) definition. The conceptual advantage of \(S/{F}_{94}\) over \(S/F\) is that it offers a closer relationship to the pathophysiological process of interest. This is not expected to be apparent in the distribution of values observed, but rather in the sensitive detection of a real therapeutic effect. For this reason, and because of the risk of selection bias (see Methods), we did not undertake a direct comparison of patients meeting the criteria for \(S/{F}_{94}\) measurement, against patients who do not. Instead, we evaluated \(S/{F}_{94}\) against other commonly used outcome measures.

Table 1 Comparison of outcome measures among 39,765 hospitalised patients aged 20-75, who required supplemental oxygen in the first 3 days in hospital

In order to select the timepoint of \(S/{F}_{94}\), several aspects were taken into account. Firstly, we looked at data availability. Within the ISARIC4C dataset, S/F values were available for the largest numbers of patients on days \(0\), \(2\), \(5\) and \(8\) from study enrolment. Second, among patients who remained in hospital, the distribution of \(S/{F}_{94}\) values moves over the first few days from study enrolment towards a bimodal pattern with high values in survivors, and low values in non-survivors (Fig. 2a). Finally, in order to make a meaningful comparison with the \(S/{F}_{94}\) at the day of enrolment, we preferred timepoints that were at least a few days after enrolment. We therefore chose day 5 as the primary timepoint for comparison. The distribution of measured \(S/{F}_{94}\) values and assigned maximum/ minimum values for those who were discharged/ died can be seen in (Fig. 2b). On day 5, 1077 out of 7,312 (14.7%) known \(S/{F}_{94}\) values were an assigned maximum/minimum value due to death/discharge. On day 8, 1948 out of 6079 (32.0%) known \(S/{F}_{94}\) values were an assigned maximum/minimum value. A sensitivity analysis excluding these assigned values is in the supplementary material.

Fig. 2: Evaluation of S/F94 in observational data.
figure 2

a Smoothed distributions of \(S/{F}_{94}\) values in survivors and non-survivors during the first \(12\) days of the study, not including assigned minimum/maximum values (restricted to \(39\),\(765\) patients aged \(20-75\), oxygen therapy within \(3\) days). b Histogram showing distribution of \(S/{F}_{94}\) values on day 5 as used for subsequent analyses (in purple). Patients discharged home before day 5 are assigned the maximum value (4.78), and patients who died before day 5 are assigned to an arbitrary minimum of 0.5 (in black). c Distribution of \(S/{F}_{94}\) values day \(5\) compared with WHO ordinal scale2 value at the same time point, in patients who met our inclusion criteria (aged \(20-75\), oxygen therapy within \(3\) days). No assigned minimum or maximum values are included in this figure. Hosp = hospitalised, no oxygen support; Ox = Hospitalised, oxygen by mask or nasal prongs; CPAP/HFNO = Hospitalised, oxygen by continuous positive airway pressure; high-flow nasal oxygen or non-invasive ventilation; IMV = Intubation and mechanical ventilation; IMV \(S/F\le 2=\) Mechanical ventilation; \(S/F\le 2\) or vasopressors; MOF = Multi-organ failure & mechanical ventilation & \(S/F\le 2\) & ECMO or renal replacement therapy. d Logistic regression analysis with \(95\%\) confidence interval, using both \(S/{F}_{94}\) on day \(0\) and \(S/{F}_{94}\) on day \(5\) as covariates, showing a clear association between mortality at \(28\) days and \(S/{F}_{94}\) value on day \(5\).

An intermediate clinical outcome should have a strong association with a definitive outcome. Using \(28\)-day mortality as the definitive outcome, and including \(S/{F}_{94}\) values on both day \(0\) and day \(5\) as covariates in a linear regression model, we found a strong inverse association between \(S/{F}_{94}\) on day \(5\) and mortality: an increased risk of mortality at day \(28\) is associated with a lower value of \(S/{F}_{94}\) on day \(5\) (Fig. 2d). The OR for \(28\)-day mortality is \(0.25\) (\(95\%\) confidence interval \(0.23\)-\(0.28\)), meaning that for a 1 unit increase in \(S/{F}_{94}\) on day 5, the odds of \(28\)-day mortality decrease by \(75\%\).

We also compared \(S/{F}_{94}\) with a widely used intermediate outcome, the WHO scale. Since this scale records clinical decisions about therapy that are, in part, determined by the severity of hypoxic lung disease, a close relationship was expected with \(S/{F}_{94}\) (Fig. 2c). The distributions were consistent between patients meeting the inclusion criteria (Fig. 2c) and unselected patients (Supplementary Fig. 5a). The distribution of \(S/{F}_{94}\) values between outcomes at day 28 for patients meeting the inclusion criteria is similar on day 0 and day 5 (Supplementary Fig. 5b and Supplementary Fig. 5c). As expected, when there are no criteria for supplemental oxygen in the first 3 days since admission (unselected patients, Supplementary Fig. 5d and Supplementary Fig. 5e), there is a relative increase of patients with high \(S/{F}_{94}\) values on day 0.

Sample size estimation

Using the observed relationships in ISARIC4C data for eligible patients (see Methods), we quantified effect sizes associated with a \(15\%\) relative risk reduction in mortality for each of the following measures: \(S/{F}_{94}\) at \(5\) and \(8\) days after study enrolment, the WHO ordinal scale at \(5\) and \(8\) days after study enrolment, the proportion of patients who reached a sustained \(1\) or \(2\)-level improvement on the WHO ordinal scale, and a definitive outcome, \(28\)-day mortality. We chose a \(15\%\) relative risk reduction in mortality based on previous power calculations for the RECOVERY trial. We then estimated the sample sizes required to detect these effects with \(80\%\) power at \(2p=0.05\) (2p indicates a two-tailed test).

Some examples of sample size estimations using different inclusion criteria can be found in the supplementary material (Supplementary Table 2 and Supplementary Table 3). We created an online tool, using synthetic data with similar characteristics to the ISARIC4C data (see Methods), to enable users to test any combination of inclusion criteria (age, frailty score and type of respiratory support) and outcome assessment timepoint: https://isaric4c.net/endpoints.

For a \(15\%\) relative reduction in mortality, the required sample size was smallest for \(S/{F}_{94}\) on day \(5\), needing \(722\) patients in each arm (\(1\)\(444\) in total, Table 1). The number of subjects required for \(S/{F}_{94}\) on day \(8\) was higher, with \(1\),\(342\) subjects in each arm (Supplementary Table 4). For the WHO ordinal scale, \(1\),\(666\) participants would be required in each arm on day \(5\), or \(1\),\(168\) on day \(8\) to detect this mortality reduction. Required sample size was larger when 1-level sustained improvement was used as the outcome variable, with \(3\),\(378\) patients in each arm, and \(1\),\(904\) subjects in each arm when using 2-level sustained improvement (Table 1). Errors around the point estimates shown in Table 1 are shown in Fig. 3 for a range of effect sizes.

Fig. 3: Comparison of the number of patients needed, including 95% confidence interval, for the different outcome measures, using treatment effects between 0.85 and 0.70.
figure 3

The bottom line shows predicted sample size required when using a protocolised \(S/{F}_{94}\) measurement, rather than an opportunistic measurement.

Estimated improvement with protocolised measurement of \(S/{F}_{94}\)

We have developed a protocol for measurement of \(S/{F}_{94}\) (Appendix: Protocol). Protocolising measurements is likely to substantially improve the accuracy of measurements of oxygenation function, firstly by ensuring that an oxygen delivery mode is used for which \({F}_{I}{O}_{2}\) can be accurately quantified (e.g. Venturi systems), and secondly by ensuring that measurements are taken at steady state. Protocolised measurement also permits inclusion of all patients, since \({F}_{I}{O}_{2}\) is decreased until \({S}_{p}{O}_{2}\le 0.94\), to a minimum of \({F}_{I}{O}_{2}=0.21\). We sought to estimate the magnitude of this improvement. We did this by fitting a measurement error model relating opportunistic and protocolised \(S/{F}_{94}\) measurements. A description of the estimation of effect size for the protocolised \(S/{F}_{94}\) measurement can be found in the supplementary methods. Based on this effect size estimate, the required sample size for a protocolised measurement of day 5 \(S/{F}_{94}\) would be around \(988\) subjects in total (Fig. 3).

Discussion

In synthetic (Fig. 1) and real (Supplementary Fig. 1) physiological data, we found that \({S}_{a}{O}_{2}\le 0.94\) is a pragmatic cut-off threshold, lying within a safe range, excluding the majority of obviously misleading values caused by the ceiling effect, and optimising predictive validity. Using observational data from the ISARIC4C study, we demonstrate that \(S/{F}_{94}\) fulfills our initial requirements for an intermediate outcome: a continuous outcome measure that is closely related to mortality and can be modified by therapy3. Testing predicted statistical power for a range of effect sizes in observational data, we found that \(S/{F}_{94}\) is more sensitive than other widely-used outcomes. Comparing both the WHO ordinal scale and \(S/{F}_{94}\) to the definitive outcome of mortality at day 28, we found that the same predicted treatment effect can be detected with fewer patients using \(S/{F}_{94}\), even when measurements are not protocolised.

In a clinical trial setting, where both \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) measurement can be protocolised, sensitivity is predicted to improve because protocolised measurement are less noisy and are therefore expected to have a stronger relationship with mortality. Using the SD for protocolised \(S/{F}_{94}\) during the RECOVERY trial, together with the assumed error measurement model relating protocolised and opportunistic \(S/{F}_{94}\) measurements, we predict a substantial additional improvement in statistical power using a protocolised measurement.

Our analyses may underestimate the statistical power of mortality, since time-to-event analyses would be used in most circumstances to maximise statistical power. Due to the large proportion of missing data after day 10, it was not possible to carry out survival modelling in our data. Ideally, we would have performed a mediation analysis with treatment effect, to determine the extent to which the treatment effect on mortality is explained by the intermediate endpoint \(S/{F}_{94}\). However, since there is no \(S/{F}_{94}\) data available from clinical studies showing significant treatment effect, it is not possible to perform this analysis.

Some important sources of error exist in the outcome measures we considered. Firstly, \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) are both subject to measurement error, particularly in opportunistic data. For example, estimating \({F}_{I}{O}_{2}\) for patients receiving supplemental oxygen via nasal cannula or simple (Hudson) masks is inaccurate, because the \({F}_{I}{O}_{2}\) is profoundly affected by inspiratory flow rate, which varies between patients. This error would be eliminated by protocolised measurement, which mandates the use of devices delivering a fixed \({F}_{I}{O}_{2}\). Secondly, the position of a patient on the ordinal WHO scale is influenced by both availability of resources and the decision by the patient and the clinician whether to escalate the level of care or provide organ support. This may explain the wide range of \(S/{F}_{94}\) values for patients at the same position on the WHO scale.

There are multiple advantages of using \(S/{F}_{94}\) as an intermediate outcome measure in a phase II clinical trial in hospitalised patients. It is an easy, non-invasive measurement, using near-ubiquitous monitoring equipment. In contrast, daily \({P}_{a}{O}_{2}\) measurements (from an arterial blood sample) are time-consuming, require highly skilled staff, and are burdensome for patients unless an indwelling arterial catheter is present (unusual outside of critical care areas). It is likely that the results of recent and ongoing clinical trials suggesting harm from hyperoxia will, in future, mean high \({S}_{a}{O}_{2}\) values a less common finding, particularly in the intensive care unit.

In order to determine the utility of a surrogate outcome in clinical trials, a distinction can be made between “individual level surrogacy” and “trial-level surrogacy”18. If there is an association between the surrogate and the outcome of interest in individual patients, the surrogate works on an individual level. If the effect that a treatment has on the surrogate can be used to predict the causal effect treatment has on the outcome, there is also trial level surrogacy. There are some scenarios, as explained by Buyse and colleagues18, in which there is individual-level surrogacy but no trial level surrogacy, for example due to (known and unknown) confounders, or treatment being dependent on the surrogate (e.g. low \(S/{F}_{94}\) values could lead to additional interventions that influence the outcome, confounding the influence of treatment on outcome). Trial-level surrogacy can be demonstrated with data from (multiple) randomised controlled trials. With the data we have available, we can thus only show individual-level surrogacy and not trial-level surrogacy. Determining whether \(S/{F}_{94}\) is also a trial-level surrogate would be a desirable objective for further studies.

Of the pragmatic endpoints available from routinely collected data, the WHO ordinal scale is the best-performing endpoint. In studies where clinical observations can be obtained, \(S/{F}_{94}\) is a robust measure of pulmonary oxygenation function, and is the best measure to optimise statistical power for comparisons. \(S/{F}_{94}\) is comparable to the P/F ratio as a measure of pulmonary oxygenation, and superior to \({S}_{p}{O}_{2}\)/\({F}_{I}{O}_{2}\) ratio. Where protocolised measurements can be obtained, further improvements in statistical power are expected. \(S/{F}_{94}\) may have utility in clinical studies of other disease processes where pulmonary oxgenation failure contributes to mortality, such as influenza and ARDS19.

In conclusion, \(S/{F}_{94}\) is a powerful and robust intermediate endpoint for clinical studies of COVID-19 and may have broad utility in forms of acute lung injury.

Methods

Ethical approval

All research described in this study complies with all relevant ethical regulations. Ethical approval was given by the South Central-Oxford C Research Ethics Committee in England (13/SC/0149), the Scotland A Research Ethics Committee (20/SS/0028), and the WHO Ethics Review Committee (RPC571 and RPC572, April 2013). In England and Wales, consent was not required for the collection of depersonalised routine healthcare research data. In Scotland, a waiver for consent was given by the Public Benefit and Privacy Panel.

Relationship to the reference standard (\({{{\boldsymbol{P}}}}{{{\boldsymbol{/}}}}{{{\boldsymbol{F}}}}\) ratio)

The \(P/F\) ratio is the oxygenation measure used in diagnostic criteria for acute respiratory failure, and is used in our analysis as the reference standard20. We evaluated the relationship between \(S/F\) and \(P/F\) in two datasets: a synthetic dataset of \(1\),\(529\),\(176\) predictions covering a wide range of possible physiological variation, generated by a mathematical model of oxygen delivery written in Python (available at https://github.com/baillielab/oxygen_delivery) and reported previously17, and \(72\),\(457\) unselected arterial blood gas results from a critically ill population17. Taking \(P/F\) to be our reference standard, we evaluated \(S/F\) at different thresholds in both synthetic and real data.

Predictive validity

We considered the predictive validity of \(S/F\) and \(S/{F}_{94}\) compared to \(P/F\) and two other measures of oxygenation function: the A-a, and effective shunt fraction (ES)17.

Predictive validity quantifies the extent to which a clinical measurement predicts an unseen event. The aim is not to optimise prediction, but to test the extent to which a measurement is describing a real feature of the patient’s illness21. In this case, we contend that a measure that accurately describes pulmonary oxygenation function will accurately predict \({P}_{a}{O}_{2}\) after a change is made to \({F}_{I}{O}_{2}\). Using the same pre-existing dataset of ABG results from critically ill patients as in our previous study17, we used this approach to assess the validity of \(S/F\) and \(S/{F}_{94}\).

Briefly, in pairs of arterial blood gas results taken from the same patient <3 h apart, in which \({F}_{I}{O}_{2}\) was decreased in the later sample, we used various measures of oxygenation (A-a, \(P/F\), ES, \(S/F\)) in the first ABG to predict the \({P}_{a}{O}_{2}\) in the second sample and compared these predicted values with the \({P}_{a}{O}_{2}\) that was measured in the second sample. Predictive validity was quantified by the median absolute error (MAE). A baseline value, showing the difference between ABG results for matched pairs in which \({F}_{I}{O}_{2}\) did not change, is provided to contextualise the MAE results as a reasonable minimum error value. Results are presented as difference in MAE from this baseline. The Mann-Whitney U-test (MWU) was used for the comparison of MAE difference from baseline.

Evaluation in ISARIC4C data

Inclusion criteria

All subjects were part of the ISARIC Coronavirus Clinical Characterisation Consortium (ISARIC4C) WHO Clinical Characterisation Protocol UK (CCP- UK), a study in England, Wales, and Scotland prospectively collecting data from patients hospitalised with SARS-CoV-2 infection since the start of the pandemic.

In order to focus our assessment on the subset of patients with hypoxaemic respiratory failure that is potentially modifiable by anti-inflammatory treatment, we repeated all analyses in subjects aged \(20\)-\(75\) who required supplementary oxygen therapy within 3 days of hospital admission, subjects aged \(20\)-\(75\) that were oxygen dependent on the day of admission, and subjects aged \(20\)-\(75\) without criteria for oxygen dependency. All included patients had \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) data available. While \({S}_{p}{O}_{2}\) is typically represented as a percentage, for \(S/{F}_{94}\) it is used as a fraction, with values ranging from 0-1.

Estimation of \(S/{F}_{94}\) in observational data

The \(S/F\) ratio was calculated by dividing \({S}_{p}{O}_{2}\) by \({F}_{I}{O}_{2}\) (with both as fractions, taking values between 0 and 1). For this evaluation, \(S/{F}_{94}\) was defined as an opportunistic measurement in which \({S}_{p}{O}_{2}\le 0.94\), or the patient was receiving no supplementary oxygen (\({F}_{I}{O}_{2}=0.21\)).

Importantly, the retrospectively-defined subgroup of patients meeting the \(S/{F}_{94}\) criteria is not representative of all patients since there was an excess of patients who were not receiving respiratory support, with slight excess mortality, in the \(S/{F}_{94}\) group (Supplementary Table 1). This indicates at least two mechanisms of selection bias, acting in opposite directions, and precluding a direct comparison. Firstly, patients who have high blood oxygen levels on relatively little supplementary oxygen are excluded from the \(S/{F}_{94}\) group; by definition these patients have relatively mild disease. Secondly, the group in whom \(S/{F}_{94}\) could be measured includes patients who receive supplemental oxygen, and fail to reach adequate \({S}_{p}{O}_{2}\) values, but are not escalated to a higher level of respiratory support; this is a frail and multimorbid population with very severe disease.

\(S/{F}_{94}\) was calculated at baseline (day 0) and on day 5 and day 8 from study enrolment. There is expected to be differential missingness between \(S/{F}_{94}\) and mortality: \({S}_{p}{O}_{2}\) and \({F}_{I}{O}_{2}\) data are only available for a proportion of cases, whereas outcome data is well-recorded. Patients who died or were discharged on given day and had a missing value for \(S/{F}_{94}\) were assigned values 0.5 (severe oxygenation defect) and 4.76 (perfect oxygenation), respectively. However, death/discharge was more likely to be recorded than \(S/{F}_{94}\), and this could introduce bias into our analysis. We addressed this by estimating the proportion of patients for whom \(S/{F}_{94}\) measurements were available among those who had not died or been sent home by a given day. We then resampled those who died/discharged according to these proportions. For example, if on day 5, 20\(\%\) of those who had not died or discharged had \(S/{F}_{94}\) measurements available, we randomly resampled 20\(\%\) of those who died/had been discharged by then, assigning \(S/{F}_{94}=0.5\) to those who died, and \(S/{F}_{94}=4.76\) to those who were discharged.

Association between \(S/{F}_{94}\) and 28-day mortality

Two key assumptions underlie the use of \(S/{F}_{94}\) as an intermediate endpoint. Firstly, that pulmonary oxygenation function predicts mortality in COVID-19, and secondly, that \(S/{F}_{94}\) accurately reflects the pulmonary oxygenation function. If either of these assumptions are violated, then a strong relationship between \(S/{F}_{94}\) and subsequent mortality would not be expected.

To evaluate this association, a logistic regression model was developed with 28-day all-cause mortality as the dependent variable and \(S/{F}_{94}\) measured on day 0 and day 5 as two separate covariates. We included both \(S/{F}_{94}\) on day 0 and day 5 due to the strong relationship between \(S/{F}_{94}\) on day 0 and \(S/{F}_{94}\) on days further in the disease trajectory. Linear dependence of log-odds on \(S/{F}_{94}\) measured on day 0 and day 5 was assessed both by visual inspection and using model selection criteria including the Bayesian Information Criterion (BIC) to compare to a restricted splines model. Finally, predicted models were made to assess the absolute change in risk of mortality with a change in \(S/{F}_{94}\).

Sample size calculations

We compared the sample sizes required for a range of different outcomes measures (\(S/{F}_{94}\), WHO ordinal scale, sustained improvement at day 28 and 28-day mortality). For the intermediate endpoints, we estimated the treatment effect associated with a \(15\%\) relative reduction in mortality. Below we give brief descriptions of the effect size calculations for the different outcome measures. All calculations assumed a \(1\):\(1\) allocation of participants between treatment and control groups and are based on having \(80\%\) power at 2p\(=0.05\) to detect the stated treatment effect. Details on effect size estimation can be found in the supplementary material.

Quantifying uncertainty

We bootstrapped \(95\%\) confidence for the effect size, and then used this to calculate \(95\%\) confidence intervals for required sample size using the fact that they are monotonically related.

Continuous variables (\(S/{F}_{94}\))

We fit a logistic regression with mortality at day 28 as the dependent variable, and age, sex, \(S/{F}_{94}\) on day 0 (baseline) and day 5 (or day 8) as independent variables. We used this to calculate the predicted probability of mortality, and the change in \(S/{F}_{94}\) associated with a relative reduction in predicted mortality of \(15\%\), for each subject. Finally, we took the mean to find the average change in day 5 \(S/{F}_{94}\) that is associated with a \(15\%\) reduction in mortality across the sample. This was the target treatment effect in the clinical trial. We calculated the sample size required to see this treatment effect with a given level of power using a two sample t-test with ANCOVA correction for the correlation between \(S/{F}_{94}\) on day 0 and day 522.

Ordinal variables (WHO scale)

Values for the WHO ordinal scale were derived using information about oxygen support and mortality. Possible values in hospitalised patients range between 4 and 102.

WHO scale - absolute value

We fitted a proportional odds model with the WHO ordinal scale as the dependent variable, and age and sex as independent variables. We used this model to estimate the odds ratio associated with a \(15\%\) relative reduction in mortality23.

WHO scale - sustained improvement

We derived binary variables for sustained 1- or 2-level improvement on the WHO scale. To be considered sustained, an improvement had to be maintained until discharge or until day 28. We fitted a logistic regression model with mortality at day 28 as the dependent variable, and age, sex and sustained 1- or 2-level improvement on the WHO scale as independent variables. We used this model to estimate the difference in proportion of people who had a sustained improvement on the WHO ordinal scale that was associated with a \(15\%\) reduction in risk of mortality. We then calculated required sample size for this outcome using a two-sample test for proportions with a continuity correction24. Only patients who had WHO ordinal scale values on at least two separate days were included in this analysis.

Mortality

In order to compare these alternative outcome measures with a definitive outcome (mortality), we calculated the number of participants needed if 28-day mortality was the outcome measure, using a two-sample test for proportions with continuity correction.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.