replying to Marian Quanjel et al. Nature Machine Intelligence https://doi.org/10.1038/s42256-020-00253-3 (2020)

replying to Claire Dupuis et al. Nature Machine Intelligence https://doi.org/10.1038/s42256-020-00252-4 (2020)

replying to Matthew Barish et al. Nature Machine Intelligence https://doi.org/10.1038/s42256-020-00254-2 (2020)

We would like to thank the authors for their interest in our paper1, the application of the model to new datasets and for sharing the data. Differences in hospital and laboratory protocols can lead to significant changes in blood sample distributions. In addition, it is possible that genetic heterogeneity between Asians and Caucasians also impacts blood samples. We are very interested in understanding the differences between the data and would welcome a collaboration between our groups.

To understand the differences in the performance of the model among these datasets and those in ref. 1, we directly compared the distributions of the three key biomarkers (Figs. 1 and 2). Figure 1 shows the distributions of the three biomarkers in all blood samples, while Fig. 2 separates the blood samples according to patient outcome. What is clear from Fig. 1 is that the distributions from Tongji Hospital (top row) are very different from those from the St Antonius Hospital (Nieuwegein, the Netherlands) (AH, second row), French Outcomerea (FO, third row) and Northwell Health (US) (NH, bottom row) datasets. We examined pairwise Kolmogorov–Smirnov tests between Tongji Hospital and AH, FO and NH for each of the three biomarkers. These show that the data distributions for all three biomarkers, for Tongji and the other hospitals, are statistically different.

Fig. 1: Distribution of all three biomarkers for the different datasets.
figure 1

From left to right: distributions of lactate dehydrogenase (LDH/LD), high-sensitivity C reactive protein (hs-CRP) or C reactive protein (CRP) and percent lymphocytes in blood samples from the different datasets: Tongji Hospital (last 10 days), AH (last 10 days), FO (all data) and NH (all data).

Fig. 2: Fitted distributions of the three biomarkers on the different datasets, separated by outcome.
figure 2

ac, Distributions of LDH/LD (a), hs-CRP/CRP (b) and percent lymphocytes (c) in blood samples from the Tongji Hospital (training, test), AH, FO and NH datasets.

In Fig. 2, all three biomarkers in the data from Tongji Hospital (training and external test data are combined) have a clear separation between survival and death. However, this is not the case with the datasets from AH, FO and NH, as there are considerable overlaps between surviving and deceased patients, thus making it difficult to predict a patient’s outcome.

The reasons for these changes in distributions are unknown and require further investigation.

One possible explanation for the differences in the distributions in Figs. 1 and 2 is the differences in hospital protocols. In particular, the discharging protocols seem very different. Table 1 looks at the last blood samples before outcome. On average, surviving patients from AH, FO and NH were discharged with the three biomarkers considerably outside normal ranges (for LDH, ~80–250 U l−1 (ref. 2); for CRP, <10 mg l−1 (ref. 3); for lymphocytes (%), ~20–40% (ref. 4))5,6,7. In other words, surviving patients from AH, FO and NH seem to be released earlier than those at Tongji Hospital. Hence, patients with relatively high values of LDH are assigned with a survival outcome. It is possible that if these patients had remained in the hospital longer, their 10 days to outcome blood samples distributions would have been closer to those of Tongji Hospital.

Table 1 LDH, CRP and percent lymphocytes results

All hospitals in China follow the following strict discharge protocol set by the China National Health Commission8:

  • the patient’s temperature has remained normal (<37.3 °C) for more than three days

  • respiratory symptoms have been relieved

  • COVID-19 nucleic acid in respiratory tract specimens has tested negative twice in a row (sampling interval of at least 24 h)

  • the chest image shows absorption in the lungs

It would be interesting to compare the discharge protocols for patients in the AH, FO and NH datasets with those of Tongji Hospital.

A second explanation may be related to the different laboratory protocols used in the hospitals. For example, Tongji Hospital uses the ‘lactate dehydrogenase acc.to IFCC ver.2’ kit made by Roche Diagnostics GmbH to measure LDH. It would be interesting to review the literature to compare protocols between hospitals. Also, Tongji Hospital measures hs-CRP, while some hospitals measure CRP. As shown in ref. 9, these two measurements are not equivalent. Finally, it would be interesting to compare the rates of haemolysis and details of the laboratory protocols overall.

Third, as mentioned by the authors, LDH expression seems to have substantial genetic heterogeneity between Asians and Caucasians10,11. Assuming that NH hospitals have a wide range of ethnic patients, further data separated by ethnicity may provide new clues.

Fourth, different hospital treatments or baseline characteristics of patients can influence outcomes.

Fifth, mortality in intensive care and in non-critical care settings has been dropping by 2–5% every week since April 202012. This could create discrepancies between the data used in ref. 1 and the data from the AH, FO and NH datasets. One solution would be to retrain the model as new data become available (see below).

Sixth, it has been reported in refs. 13,14 that there are at least two lineages of SARS-CoV-2 virus. As yet, the implications of these evolutionary changes for disease aetiology remain unclear. It is possible that patients may have different expressions of these biomarkers because they have been infected with different strains. In particular, China, Europe and the United States seem to be classified in distinct clusters.

Seventh, patient selection for the FO dataset did not follow the complete patient selection process that was used in ref. 1 and hence did not serve as unbiased validation of the model in ref. 1. According to ref. 7, the following patients were excluded from the FO dataset: ‘very low LDH and CRP serum levels and high lymphocyte counts (these patients have good outcomes)’ and ‘some of the most severely ill patients with high CRP and LDH serum levels and low lymphocyte counts, who are not admitted to ICU because of therapeutic limitation (these patients have the worst outcomes)’. In essence, patients that would have been correctly classified by our model were removed, leaving only a selection of intermediate patients that are harder to classify, thus reducing the overall accuracy of our model. This is confirmed by their statement, ‘Thus, it is not surprising that the predictive rule of Yan et al. was not accurate in our cohort’7.

Model retraining

Recall that the model in ref. 1 was trained only with the last samples taken, although it could then be applied to other blood samples, including at hospital admission (see below)15. For the AH dataset, we retrained the model using data within 10 days from outcome because there was no information as to which samples were the last before outcome (there is only one sample per patient available in the data from AH). The retrained model followed exactly the same single-tree XGBoost method specified in ref. 1: max depth equal to 3, learning rate equal to 0.1, number of tree estimators set to 1, regularization parameter α set to 0, and ‘subsample’ and ‘colsample_bytree’ both set to 1. We achieved an averaged accuracy of 0.83 (0.76, 0.82, 0.9, 0.82, 0.84) using fivefold cross-validation. The fivefold cross-validation was necessary because there were no further available data to test the model. Moreover, we could not test the model on admission data and over time, given that this information was not available.

Retraining was not possible for the FO dataset due to the use of a patient selection process different from that of ref. 1.

For the NH dataset we retrained our decision tree model with the same single-tree XGBoost method and used the last blood samples (exactly the same as in ref. 1). We achieved a training accuracy of 0.78. Moreover, the retrained model achieved 0.72 accuracy using only the first blood sample (admission samples), showing that the retrained model is useful in triaging patients upon admission. Finally, the mortality arm considerably improved its performance: 75% (65%) on the retrained model versus 41% (40%) on the original model in ref. 1 using the last (first) blood samples. We could not test the model over time as in ref. 1, because the dates of blood samples were not available.

Further validation

Since the publication of the paper1, we have obtained further data besides that from Tongji Hospital, including from two new hospitals in China. We applied the decision tree in ref. 1 to new patient data from Jinyintan Hospital in Wuhan and No. 3 People’s Hospital in Shenzhen16. The datasets from Jinyintan and Shenzhen include all patients with COVID-19 for whom there were values for all three biomarkers until 31 March 2020 and 13 April 2020, respectively. The results are presented in Fig. 3. Overall, both hospitals show a performance similar to that of Tongji, with accuracies of 94% and 90%, respectively. This demonstrates that exactly the same model can predict the mortality of individual patients more than 10 days in advance with more than 90% accuracy in different centres in China.

Fig. 3: Results with datasets from two further hospitals in China.
figure 3

a,b, Performance of the discovered clinical route on the external test dataset from Jinyintan Hospital (a) and No. 3 People’s Hospital Shenzhen (b).

Another independent study (the WHO COVID-19 database17) identified the same top three biomarkers (LDH, lymphocyte, CRP) using data from five Chinese centres (Guizhou Provincial People’s Hospital, Affiliated Hospital of Zunyi Medical University, Jiangjunshan Hospital of Guizhou Province, Zhongnan Hospital of Wuhan University and the Radiology Quality Control Center database of Hunan Province). In addition, there are many other publications that have independently identified similar biomarkers. For example, lymphocytes were identified as a risk factor in refs. 18,19,20,21,22, CRP in refs. 19,20,21,23,24 and LDH in refs. 18,19,24,25. This further validates the importance of these three biomarkers.

Admission samples

The comment in ref. 26 suggested presenting the performance of the model from ref. 1 on blood samples taken at admission. Indeed, this analysis should have been in ref. 1 and is now shown in Fig. 4.

Fig. 4: Results using first blood samples taken at admission from patients at Tongji Hospital.
figure 4

Performance of the discovered clinical route on the external test dataset from Tongji Hospital using patients’ first blood samples taken at admission.

The overall accuracy at admission is 88%, with a survival (death) accuracy of 98.8% (48%). More importantly, at admission, of the 110 patients, the model would have stratified 85 as low risk (with only one wrong) and 25 as high risk (of which 12 died). Hence, and as expected, the model was more conservative with high-risk patients. Overall, 85 patients out of a total of 110 (77%) would have been classified at admission as low risk and relieved hospital resources. This shows that the model provides useful triage information at admission.

Discussion

As stated in the last paragraph of the discussion in ref. 1, the model was developed and tested with high accuracy with data from a single hospital in Wuhan, China. All statements in the paper are based on data from Tongji Hospital, as this was the only data we had available at the time of publication. Since the publication of ref. 1, we have further validated the model on data from two additional hospitals in China.

Reference 1 and the comments in relation to AH, FO and NH have opened interesting discussions and research questions that we hope to pursue together. Moreover, we call on the participation of hospitals around the world to share their data and we welcome an opportunity to collaborate. For example, the comments in relation to FO list a number of very interesting extensions for the model and we would welcome collaborations to tackle them.

Finally, it is clear that, at any given time, we do not know how many days are left until the outcome. Nevertheless, fig. 3d,e in ref. 1 shows that the accuracy of prediction improves as new blood samples become available. This remains true even when the date of outcome is unknown (that is, in a practical clinical situation). Note that, even with 18 days until the outcome, the overall cumulative prediction accuracy is still above 90%, and at admission is at 88%.

Summary

We tested the model with three new datasets from St Antonius Hospital in Nieuwegein, the Netherlands (referred to as AH), French Outcomerea (FO) and Northwell Health, United States (NH). The key messages are as follows:

  • We retrained our model with data from AH, and the overall cross-validation accuracy increased from 53% to 83%.

  • Patient selection for FO followed very different criteria from ref. 1 and hence it does not serve as unbiased validation of the model in ref. 1.

  • We retrained our model for NH, and the accuracy increased from 50% to 78% on last blood samples and the testing accuracy increased from 48% to 72% on first blood samples (admission).

  • The same model as in ref. 1 (without retraining) was applied to new data from Jinyintan Hospital in Wuhan and No. 3 People’s Hospital in Shenzhen, with accuracies of 94% and 90%, respectively.

In some cases, no retraining is needed, while in others, retraining significantly improves model performance.