Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Machine Learning for COVID-19 needs global collaboration and data-sharing

The COVID-19 pandemic poses a historical challenge to society. The profusion of data requires machine learning to improve and accelerate COVID-19 diagnosis, prognosis and treatment. However, a global and open approach is necessary to avoid pitfalls in these applications.

On 31 December 2019, the first cases of a viral pneumonia with unknown aetiology were reported in the city of Wuhan, China. In the following weeks, the Chinese authorities and the World Health Organization (WHO) announced the discovery of a novel coronavirus and its associated disease: SARS-CoV-2 and COVID-19, respectively. On 21 April 2020, the number of cases of COVID-19 exceeded 2.4 million and the death toll exceeded 170,000 worldwide1. The outbreak of COVID-19 represents a major and urgent threat to global health. While the unprecedented speed of the COVID-19 spread partly finds its roots in our increasingly globalized society, the global sharing of scientific data also offers a promising tool to fight the disease. In the past four months, more than 12,400 articles have been published2 and scientific data collected from thousands of patients have been released3. The majority of these studies follow the standard scientific method: that is, investigate a few hypotheses at a time on a controlled sample. While undeniably successful, this standard method suffers from two well-known challenges, both critical to our pandemic situation: (1) it requires considerable expertise and human input and (2) it only considers a handful hypotheses at a time. Machine learning (ML) has been used to meet these challenges in various pathologies4,5, including infectious diseases6. Herein, we describe two areas where ML could supplement standard statistical methods in the COVID-19 pandemic, discuss the practical challenges that such a ML approach entails, and advocate for a global collaboration and data-sharing.

ML to alleviate the workload of medical experts

While standard statistical methods can provide the first results necessary in an emergency, they often require considerable human resources, which are precisely lacking in such context. Health systems found themselves quickly overwhelmed and the potential for data analysis, particularly in clinical research, was limited by the amount of work required. ML techniques can decrease the time required to produce automated analyses and allow artificial intelligence practitioners to support clinicians. For example, medical imaging studies show that chest computed tomography (CT) scans can be used to detect COVID-19 lesions7,8. However, such studies typically require each scan to be reviewed by a trained radiologist, who could otherwise be working on the front lines. ML may alleviate this task: recent supervised classifiers trained over a large dataset of 400,000 chest X-Rays achieved a mean area under the receiver operating characteristic curve (false positive rate versus true positive rate) of 94% for the diagnosis of 14 distinct lung pathologies9. Furthermore, preliminary studies based on a few hundred chest CT scans suggest that COVID-19 can be automatically diagnosed with ML10. However, the use of ML of medical images to diagnose or prognose COVID-19 remains currently limited to relatively small cohorts. These studies thus poorly control for the numerous confounds (for example, age, corpulence) that the algorithms may detect from chest images. A promising strategy is to pre-train ML models from larger datasets of similar images, thus learning common features to compute, which can then be used to facilitate training from COVID-19 images. This strategy has been used again and again in computer vision in recent years, to achieve impressive results in tasks with few labelled examples11.

ML to accelerate the screening of treatments

Standard methods only consider a handful hypotheses at a time. For example, among more than 1,200 clinical trials that have been registered to identify treatments for COVID-19, the majority focus on a unique drug or a couple of drugs, hand-selected on rationales of varying relevance12. ML can broaden such a screening and selection process by simultaneously considering several potential antiviral agents, relying on DNA sequences and/or protein structure, including potential drug binding sites of SARS-CoV-2, to predict interactions between drugs and the virus, and thus shortlisting promising candidate treatments13,14. ML has been used in other infectious diseases in a similar fashion15: for example, a deep neural network was successfully trained to screen the activity of more than 100 million molecules on Escherichia coli16. In the same way, a large spectrum of vaccine candidates could be screened based on their potential to elicit an effective immune response, for example, by presenting the spike protein S that follows a SARS-CoV-2 infection17. Nonetheless, these potentially fruitful avenues should not hide the challenges of therapeutic research based on ML. First, ML cannot accelerate basic biology, and even the prediction of protein folding remains a remarkably difficult problem18. In the case of vaccines, there is therefore a necessary waiting period. Second, a major ethical concern is the temptation to bypass proper clinical trials: working with very small cohorts, not using adequate design, or omitting inclusion and exclusion criteria have already be reported in the recent hydroxychloroquine-based treatment research19. This risk could dramatically increase with ML algorithms. Indeed, algorithms such as deep neural networks are ‘general approximators’: they can be trained to fit any objective on a dataset by, for example, memorizing the diagnosis of every patients. ML algorithms can only be evaluated conclusively by assessing their ability to accurately predict an independent test set — an approach that necessitate large datasets and a priori inclusion and exclusion criteria.

A major need for data sharing

While standard statistical analyses are adapted for many clinical and epidemiological challenges, ML is essential to accelerate the analysis of complex and large datasets such as large genomic or medical imaging datasets. Overall, ML is thus promised to supplement rather than supersede standard methods used for diagnoses, prognosis and treatment. However, two major challenges currently limit the potential impact of ML. First, ML algorithms are notoriously difficult to interpret. While visualization tools may highlight the combination of variables that led an algorithm to make a particular prediction, healthcare professionals must be aware that, like humans, ML can easily be affected by systematic biases (for example, scanning device, patient’s age and so on). Special pedagogical efforts must thus be made in both scientific reports and in the clinics to maintain a healthy scepticism when it comes to ML findings. Second, the lack of large healthcare, clinical, imaging and genetic public repositories leads each institution to locally develop its own analytical pipeline on its own small dataset, which significantly limits the generalizability of the results. While this issue is not specific to ML, the ability of modern algorithms to encompass heterogeneous datasets should drive us to both (1) share the de-anonymized raw data used in each clinical study, and (2) favour the development of large cohorts. The International Severe Acute Respiratory and emerging Infection Consortium (ISARIC) initiative aims to provide a large and shared clinical database on COVID-19 patients20. Other institutions have signed data-sharing agreements to ensure that data is shared widely and rapidly21,22, and can inform new hypotheses, but this is still done in a piecemeal fashion, making it difficult to make the most of the data generated daily during the pandemic. Not only will the quality of the standard and ML models directly depend on the size, quality and representativeness of such databases, but they will be critical to support effective interventions across different countries and types of healthcare facilities6. Open sharing of clinical databases requires significant care to properly manage regulatory and data privacy issues. Rapidly resolving these issues can be particularly challenging during a pandemic, when many public institutions are not operating normally. However, until we meet these challenges, ML may not keep its promises to help fight the virus.


The COVID-19 outbreak is not the first pandemic and is unlikely to be the last. For the first time, however, our societies have the means to provide a coordinated, evidence-based, fair and global public-health response. While the efficiency of this response may partly depend on ML, it depends even more crucially on our ability to set up global collaborations and data-sharing agreements that can accelerate the discovery and validation of promising interventions.


  1. 1.

    Dong, E., Du, H. & Gardner, L. Lancet Infect. Dis. 20, 533–534 (2020).

    Article  Google Scholar 

  2. 2.

    Dimensions COVID-19 publications, data sets, clinical trials. Figshare (2020).

  3. 3.

    Wu, Z. & McGoogan, J. M. JAMA 323, 1239–1242 (2020).

    Article  Google Scholar 

  4. 4.

    Claassen, J. et al. N. Engl. J. Med. 380, 2497–2505 (2019).

    Article  Google Scholar 

  5. 5.

    Sitt, J. D. et al. Brain 137, 2258–2270 (2014).

    Article  Google Scholar 

  6. 6.

    Peiffer-Smadja, N. et al. Clin. Microbiol. Infect. (2019).

  7. 7.

    Ai, T. et al. Radiology (2020).

  8. 8.

    Chen, Z. et al. Eur. J. Radiol. 126, 108972 (2020).

    Article  Google Scholar 

  9. 9.

    Pham, H. H., Le, T. T., Tran, D. Q., Ngo, D. T. & Nguyen, H. Q. Preprint at (2019).

  10. 10.

    Zheng, C. et al. Preprint at (2020).

  11. 11.

    Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. Preprint at (2020).

  12. 12.

    Belhadi, D. et al. Preprint at (2020).

  13. 13.

    Liu, X. & Wang, X.-J. J. Genet. Genom. 47, 119–121 (2020).

    Article  Google Scholar 

  14. 14.

    Computational predictions of protein structures associated with COVID-19. Deepmind (2020).

  15. 15.

    Peiffer-Smadja, N. et al. Clin. Microbiol. Infect. (2020).

  16. 16.

    Stokes, J. M. et al. Cell 180, 688–702e13 (2020).

  17. 17.

    Weiskopf, D. et al. Preprint at (2020).

  18. 18.

    Senior, A. W. et al. Nature 577, 706–710 (2020).

  19. 19.

    Gautret, P. et al. Int. J. Antimicrob. Agents (2020).

  20. 20.

    COVID-19 Clinical Research Coalition Lancet 395, 1322–1325 (2020).

  21. 21.

    Sharing research data and findings relevant to the novel coronavirus (COVID-19) outbreak. Wellcome Trust (2020).

  22. 22.

    Open-access data and computational resources to address COVID-19. National Institutes of Health (2020).

Download references

Author information



Corresponding author

Correspondence to Nathan Peiffer-Smadja.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Peiffer-Smadja, N., Maatoug, R., Lescure, FX. et al. Machine Learning for COVID-19 needs global collaboration and data-sharing. Nat Mach Intell 2, 293–294 (2020).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing