Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans

Machine learning methods offer great promise for fast and accurate detection and prognostication of COVID-19 from standard-of-care chest radiographs (CXR) and computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we search EMBASE via OVID, MEDLINE via PubMed, bioRxiv, medRxiv and arXiv for published papers and preprints uploaded from January 1, 2020 to October 3, 2020 which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 61 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher quality model development and well documented manuscripts.


Introduction
In December 2019, a novel coronavirus was first recognised in Wuhan, China 1 . On January 30, 2020, as infection rates and deaths across China soared and the first death outside China was recorded, the WHO described the then-unnamed disease as a Public Health Emergency of International Concern 2 . The disease was officially named Coronavirus disease 2019 (COVID-19) by February 12, 2020 3 , and was declared a pandemic on March 11, 2020 4 . Since its first description in late 2019, the COVID-19 infection has spread across the globe, causing massive societal disruption and stretching our ability to deliver effective healthcare. This was caused by a lack of knowledge about the virus's behaviour along with a lack of an effective vaccine and anti-viral therapies.
Although reverse transcription polymerase chain reaction (RT-PCR) is the test of choice for diagnosing COVID-19, imaging can complement its use to achieve greater diagnostic certainty or even be a surrogate in some countries where RT-PCR is not readily available. In some cases, CXR abnormalities are visible in patients who initially had a negative RT-PCR test 5 and several studies have shown that chest CT has a higher sensitivity for COVID-19 than RT-PCR, and could be considered as a primary tool for diagnosis [6][7][8][9] . In response to the pandemic, researchers have rushed to develop models using artificial intelligence (AI), in particular machine learning, to support clinicians.
Given recent developments in the application of machine learning models to medical imaging problems 10,11 , there is fantastic promise for applying machine learning methods to COVID-19 radiological imaging for improving the accuracy of diagnosis, compared to the gold-standard RT-PCR, whilst also providing valuable insight for prognostication of patient outcomes. These models have the potential to exploit the large amount of multi-modal data collected from patients and could, if successful, transform detection, diagnosis, and triage of patients with suspected COVID-19. Of greatest potential utility is a model which can not only distinguish COVID-19 from non-COVID-19 patients but also discern alternative types of pneumonia such as those of bacterial or other viral aetiologies. With no standardisation, AI algorithms for COVID-19 have been developed with a very broad range of applications, data collection procedures and performance assessment metrics. Perhaps as a result, none are currently ready to be deployed clinically. Reasons for this include: (i) the bias in small data sets; (ii) the variability of large internationally-sourced data sets; (iii) the poor integration of multi-stream data, particularly imaging data; (iv) the difficulty of the task of prognostication, and (v) the necessity for clinicians and data analysts to work side-by-side to ensure the developed AI algorithms are clinically relevant and implementable into routine clinical care. Since the pandemic began in early 2020, researchers have answered the 'call to arms' and numerous machine learning models for diagnosis and prognosis of COVID-19 using radiological imaging have been developed and hundreds of manuscripts have been written. In this paper we reviewed the entire literature of machine learning methods as applied to chest CT and CXR for the diagnosis and prognosis of COVID-19. As this is a rapidly developing field, we reviewed both published and preprint works to ensure maximal coverage of the literature.
While earlier reviews provided a broad analysis of predictive models for COVID-19 diagnosis and prognosis [12][13][14][15] , this review highlights the unique challenges researchers face when developing classical machine learning and deep learning models using imaging data. This review builds on the approach of Wynants 12 : we assess the risk of bias in the papers considered, going further by incorporating a quality screening stage to ensure only those papers with sufficiently documented methodologies are reviewed in most detail. We also focus our review on the systematic methodological flaws in the current machine learning literature for COVID-19 diagnosis and prognosis models using imaging data. We also give detailed recommendations in five domains: (i) considerations when collating COVID-19 imaging datasets that are to be made public; (ii) methodological considerations for algorithm developers; (iii) specific issues about reproducibility of the results in the literature; (iv) considerations for authors to ensure sufficient documentation of methodologies in manuscripts, and (v) considerations for reviewers performing peer review of manuscripts. This review has been performed, and informed, by both clinicians and algorithm developers, with our recommendations aimed at ensuring the most clinically relevant questions are addressed appropriately, whilst maintaining standards of practice to help researchers develop useful models and report reliable results even in the midst of a pandemic.

Results
Study selection: Our initial search highlighted 2,212 papers that satisfied our search criteria; removing duplicates we retained 2,150 papers, and of these, 415 papers had abstracts or titles deemed relevant to the review question, introducing machine learning methods for COVID-19 diagnosis or prognosis using radiological imaging. Full-text screening retained 319 papers, of which, after quality review, 61 were included for discussion in this review (see Figure 1). Of these, 37 were deep learning papers, 22 were traditional machine learning papers and 2 were hybrid papers (using both approaches). The two hybrid papers both failed the CLAIM check but passed RQS.

Quality screening failures:
Deep learning papers. There were 254/319 papers which described deep learning based models and 215 of these were excluded from the detailed review (including one hybrid paper). We find that 110 papers (51%) fail at least three of our identified mandatory criteria from the CLAIM checklist (Supp. Mat. A7), with 23% failing two and 26% failing just one. In the rejected papers, the three most common reasons for a paper failing the quality check is due to insufficient documentation of: (1) how the final model was selected in 61% (132), (2) the method of pre-processing of the images in 58% (125), (3) the details of the training approach (e.g. the optimizer, the loss function, the learning rate) in 49% (105).
Traditional machine learning papers. There are 68 papers which describe traditional machine learning methods and 44 of these were excluded from the review, i.e. the RQS is less than 6 or the datasets used are not specified in the paper. There are only two papers which have an RQS ≥ 6 but which fail to disclose the datasets used in the analysis. Of the remaining papers the two factors which lead to the lowest RQS results are omission of: (1) feature reduction techniques in 52% of papers (23), (2) model validation in 61% of papers (27).
Full details can be found in Supp. Mat. A8.

Remaining papers for detailed analysis:
Deep learning papers. There are six non-mandatory CLAIM criteria not satisfied in at least half of the 37 papers: (1) 29 do not complete any external validation.
(2) 30 do not perform any robustness or sensitivity analysis of their model. (3) 26 do not report the demographics of their data partitions. (4) 25 do not report the statistical tests used to assess significance of results or determine confidence intervals. (5) 23 do not report confidence intervals for the performance. Traditional machine learning papers. Of the 24 papers, including the two hybrid papers, none use longitudinal imaging, perform a prospective study for validation, or standardise image acquisition by using either a phantom study or a public protocol. Only six papers describe performing external validation and only four papers report the calibration statistics (the level of agreement between predicted risks and those observed) and associated statistical significance for the model predictions. The full RQS scores are in Supp. Mat. A9.

Datasets considered:
Public datasets were used extensively in the literature appearing in 32/61 papers (see Supp. Mat. B2 for list of public datasets, three papers use both public and private data). Private data is used in 32/61 papers with 21 using data from mainland China, two using data from France and the remainder using data from Iran, the USA, Belgium, Brazil, Hong Kong and the Netherlands.
Diagnostic models using CT scans and traditional machine learning methods. Eight papers employed traditional machine learning methods for COVID-19 diagnosis using hand-engineered features 43,60-65 or CNN-extracted features 49 . Four papers 49,62,63,65 incorporate clinical features with those obtained from the CT images. All papers using handengineered features employed feature reduction, using between 4 and 39 features in their final models. For final classification, five papers used logistic regression 43,61-64 , one used a random forest 60 , one a multilayer perceptron 49 and one compared many different machine learning classifiers to determine the best 65 . Accuracies ranged from 0·76 to 0·98 43,49,60-62 . As before, we caution against direct comparison. The traditional machine-learning model in the hybrid paper 43 had a 0·05 lower accuracy than their deep learning model.
Prognostic models for COVID-19 using CT and CXR images: Eighteen papers developed models for the prognosis of patients with COVID-19 54,66-82 , fourteen using CT and four using CXR. These models were developed for predicting severity of outcomes including: death or need for ventilation 80,81 , a need for ICU admission 66,75,[79][80][81] , progression to acute respiratory distress syndrome 82 , the length of hospital stay 54,83 , likelihood of conversion to severe disease 67,68,77 and the extent of lung infection 78 . Most papers used models based on a multivariate Cox proportional hazards model 54,80,81 , logistic regression 68,[75][76][77]82,83 , linear regression 77,78 , random forest 79,83 or compare a huge variety of machine learning models such as tree-based methods, support vector machines, neural networks and nearest neighbour clustering 66,67 .
Risks of bias: Following the PROBAST guidance, the risk of bias was assessed for all 61 papers in four domains: participants, predictors, outcomes and analysis; the results are shown in Table 1. We find that 54/61 papers had a high risk of bias in at least one domain with the others unclear in at least one domain.
Predictors. For models where the features have been extracted using deep learning models, the predictors are unknown and abstract imaging features. Therefore, for these papers (38/61), we cannot judge biases in the predictors. For 19 papers, the risk of bias is recorded as low due to the use of pre-defined hand-engineered features. For the remaining 4 papers, a high risk of bias is recorded due to the predictors being assessed with knowledge of the associated outcome.
Outcomes. The risk of bias in the outcome variable was found to be low for the majority (24/61) of the papers, unclear for 26/61 and high for 11/61. To evaluate the bias in the outcome, we took different approaches for papers using private datasets and public datasets (three papers use a mixture).
For the 32 papers that use public datasets, the outcome was assigned by the originators of the dataset and not by the papers' authors. Papers using a public dataset generally have an unclear risk of bias (27/32) as they have used the outcome directly sourced from the dataset originator.
For the 32 papers that use private datasets, the COVID-19 diagnosis is due to either positive RT-PCR or antibody tests for 23/32 and have a low risk of bias. The other papers have a high (7/32) or unclear (2/32) risk of bias due to inconsistent diagnosis of COVID-19 18,43 , unclear definition of a control group 66,68 , ground truths being assigned using the images themselves 29,57,63,74 , using an unestablished reference to define outcome 83 or by combining public and private datasets 44,69,85 .
Analysis. Only ten papers have a low risk of bias for their analysis. The high risk of bias in most papers is principally due to a small sample size of COVID-19 patients (leading to highly imbalanced datasets), use of only a single internal holdout set for validating their algorithm (rather than cross-validation or bootstrapping) and a lack of appropriate evaluation of the performance metrics (e.g. no discussion of calibration/discrimination) [21][22][23]25,26,47,51,55,67,82 . One paper with a high risk of bias 35 claims external validation on dataset 16 , not realising that this already includes both datasets 17 and 86 that were used to train the algorithm.
Data analysis: There are two approaches for validating the performance of an algorithm, namely internal and external validation. For internal validation, the test data is from the same source as the development data and for external validation they are from different sources. Including both internal and external validation allows more insight to generalisability of the algorithm. We find 47/61 papers consider internal validation only with 13/61 using external validation 25,35,44,45,54,57,66,69,70,72,75,80,81 . Twelve used truly external test datasets and one tested on the same data the algorithm was trained on 35 . Table 2 we give the performance metrics quoted in each paper. Ten papers use crossvalidation to evaluate model performance 24,38,39,50,52,60,68,77,79,83 , one uses both cross-validation and an external test set 44 , one quotes correlation metrics 78 and one has an unclear validation method 20 . The other papers all have an internal holdout or external test set with sensitivity and specificity derived from the test data using an unquoted operating point (with exception of 19 that quotes operating point 0·5). It would be expected that an operating point be chosen based on the algorithm performance for the validation data used to tune and select the final algorithm. However, the ROC curves and AUC values are given for the internal holdout or external test data independent of the validation data. Figure 2, we show the quantity of data (split by class) used in the training cohort of 32 diagnosis models. We exclude many studies 21,23,61,74,87,25,26,28,32,35,38,46,48 because it was unclear how many images were used. If a paper only stated the number of patients (and not the number of images), we assumed that there was only one image per patient. We see that 20/32 papers have a reasonable balance between classes (with exceptions being 20,27,29,33,34,36,39,40,43,54,64,65 . However, the majority of datasets are quite small, with 19/32 papers using fewer than 2,000 datapoints for development (with exceptions 20,29,30,33,34,36,39,44,51,56,57,60,84 . Only seven papers used both a dataset with more than 2,000 datapoints that was balanced for COVID-19 positive and the other classes 30,44,51,56,57,60,84 . Figure 3 shows the number of images of each class used in the holdout/test cohorts. We find 6/32 papers had an imbalanced testing dataset 20,27,36,39,40,64 Only 6/32 papers tested on more than 1,000 images 20,30,39,44,57,84 . Only 4/32, had both a large and balanced testing dataset 30,44,57,84 .

Discussion
Our systematic review highlights the extensive efforts of the international community to tackle the COVID-19 pandemic using machine learning. These early studies show promise for diagnosis and prognostication of pneumonia secondary to COVID-19. However, we have also found that current reports suffer from a high prevalence of deficiencies in methodology and reporting, with none of the reviewed literature reaching the threshold of robustness and reproducibility essential to support utilisation in clinical practice. Many studies are hampered by issues with poor quality data, poor application of machine learning methodology, poor reproducibility, and biases in study design. The current paper complements the work of Wynants et al. who have published a living systematic review 12 on publications and preprints of studies describing multivariable models for screening of COVID-19 infections in the general population, differential-diagnosis of COVID-19 infection in symptomatic patients, and prognostication in patients with confirmed COVID-19 infection. While Wynants et al. reviewed multivariable models with any type of clinical input data, the present review focuses specifically on machine learning based diagnostic and prognostic models using medical imaging. Furthermore, this systematic review employed specialised quality metrics for the assessment of radiomics and deep learning-based diagnostic models in radiology. This is also in contrast to previous studies that have assessed AI algorithms in COVID-19 13,14 . Limitations of the current literature most frequently reflect either a limitation of the dataset used in the model or methodological mistakes repeated in many studies that likely lead to overly optimistic performance evaluations.
Datasets: Many papers gave little attention to establishing the original source of the images (Supp. Mat. B2). When considering papers that use public data, readers should be aware of the following: Duplication and quality issues. There is no restriction for a contributor to upload COVID-19 images to many of the public repositories 17,88-91 . There is high likelihood of duplication of images across these sources and no assurance that the cases included in these datasets are confirmed COVID-19 cases (authors take a great leap to assume this is true) so great care must be taken when combining datasets from different public repositories. Also, most of the images have been pre-processed and compressed into non-DICOM formats leading to a loss in quality and a lack of consistency/comparability. They commonly fail to mention that this consists of paediatric patients aged between one and five. Developing a model using adult COVID-19 patients and very young pneumonia patients is likely to overperform as it is merely detecting children vs. adults. This dataset is also erroneously referred to as the Mooney dataset in many papers (being the Kermany dataset deployed on Kaggle 92 ). It is also important to consider the sources of each image class, for example if images for different diagnoses are from different sources. It is demonstrated by Maguolo et al. 93 that by excluding the lung region entirely, the authors could identify the source of the Cohen 17 , Kermany 86 an AUC between 0.9210 to 0.9997 and 'diagnose' COVID-19 with an AUC=0·68.
• Frankenstein datasets. The issues of duplication and source become compounded when public 'Frankenstein' datasets are used, that is, datasets assembled from other datasets and redistributed under a new name. For instance, one dataset 92 combines several other datasets 17,89,94 without realising that one of the component datasets 94 already contains another component 89 . This repackaging of datasets, although pragmatic, inevitably leads to problems with algorithms being trained and tested on identical or overlapping datasets whilst believing them to be from distinct sources.
• Implicit biases in the source data. Images uploaded to a public repository and those extracted from publications 94 are likely to have implicit biases due to the contribution source. For example, it is likely that more interesting, unusual or severe cases of COVID-19 appear in publications.
Methodology: All proposed models suffer from a high or unclear risk of bias in at least one domain. There are several methodological issues driven by the urgency in responding to the COVID-19 crisis and subtler sources of bias due to poor application of machine learning.
The urgency of the pandemic led to many studies using datasets that contain obvious biases or are not representative of the target population, e.g. paediatric patients. Before evaluating a model, it is crucial that authors report the demographic statistics for their datasets, including age and sex distributions. Diagnostic studies commonly compare their models' performance to that of RT-PCR. However, as the ground-truth labels are often determined by RT-PCR, there is no way to measure whether a model outperforms RT-PCR from accuracy, sensitivity, or specificity metrics alone. Ideally, models should aim to match clinicians using all available clinical and radiomic data, or to aid them in decision making.
Many papers utilise transfer learning in developing their model, which assumes an inherent benefit to performance. However, it is unclear whether transfer learning offers significant performance benefit due to the overparametrisation of the models 44,61 . Many publications used the same resolutions such as 224-by-224 or 256-by-256 for training, which are often used for ImageNet classification, indicating that the pre-trained model dictated the image rescaling used rather than clinical judgement.
Recommendations: Based on the systematic issues we encountered in the literature, we offer recommendations in five distinct areas: (i) the data used for model development and common pitfalls; (ii) the evaluation of trained models; (iii) reproducibility; (iv) documentation in manuscripts, and (v) the peer review process. Our recommendations in areas (iii) and (iv) are largely informed by the 258 papers that did not pass our initial quality check, while areas (i), (ii) and (v) follow from our analysis of the 61 papers receiving our full review.
Recommendations for data. Firstly, we advise caution over the use of public repositories, which can lead to significant risks of bias due to source issues and Frankenstein datasets as discussed above.. Furthermore, authors should aim to match demographics across cohorts, an often neglected but significant potential source of bias; this can be impossible with public datasets that do not include demographic information, and including paediatric images 86 in the COVID-19 context introduces a strong bias.
Using a public dataset alone without additional new data can lead to community-wide overfitting on this dataset. Even if each individual study observes sufficient precautions to avoid overfitting, the fact that the community is focused on outperforming benchmarks on a single public dataset encourages overfitting. Many public datasets containing images taken from preprints receive these images in low-resolution or compressed formats (e.g. JPEG and PNG), rather than their original DICOM format. This loss of resolution is a serious concern for traditional machine learning models if the loss of resolution is not uniform across classes, and the lack of DICOM metadata does not allow exploration of model dependence on image acquisition parameters (e.g. scanner manufacturer, slice thickness, etc.).
Regarding CXRs, researchers should be aware that algorithms might associate more severe disease not with CXR imaging features, but the view that has been used to acquire that CXR. For example, in sick, immobile patients, an anteroposterior CXR view is used for practicality rather than the standard posteroanterior CXR projection. Also, overrepresentation of severe disease is not only bad from the machine learning perspective, but also in terms of clinical utility, since the most useful algorithms are those that can diagnose disease at an early stage 95 . The timing between imaging and RT-PCR tests was also largely undocumented, which has implications for the validity of the ground truth used. It is also important to recognise that a negative RT-PCR test does not necessarily mean that a patient does not have COVID-19. We encourage authors to evaluate their algorithms on datasets from the pre-COVID-19 era, such as performed by 96 , to validate any claims that the algorithm is isolating COVID-19-specific imaging features. It is common for non-COVID-19 diagnoses (for example, non-COVID-19 pneumonia) to be determined from imaging alone. However, in many cases these images are the only predictors of the developed model, and using predictors to inform outcomes leads to optimistic performance.

Recommendations for evaluation.
We emphasise the importance of using a well-curated external validation dataset of appropriate size in order to assess generalizability to other cohorts. Any useful model for diagnosis or prognostication must be robust enough to give reliable results for any sample from the target population rather than just on the sampled population. Calibration statistics should be calculated for the developed models to inform predictive error and decision curve analysis 97 performed for assessing clinical utility. It is important for authors to state how they ensured that images from the same patient were not included in the different dataset partitions, such as describing patient-level splits. This is an issue for approaches that consider 2D and 3D images as a single sample and also for those which process 3D volumes as independent 2D samples. It is also important when using datasets containing multiple images from each patient. When reporting results, it is important to include confidence intervals to reflect the uncertainty in the estimate, especially when training models on the small sample sizes commonly seen with COVID-19 data. Moreover, we stress the importance of not only reporting results, but also demonstrating model interpretability with methods such as saliency maps, which is a necessary consideration for adoption into clinical practice. We remind authors that it is inappropriate to compare model performance to RT-PCR or any other ground truths. Instead, authors should aim for models to either improve the performance and efficiency of clinicians, or, even better, to aid clinicians by providing interpretable predictions. Examples of interpretability techniques include: (i) informing the clinician of which features in the data most influenced the prediction of the model, (ii) linking the prognostic features to the underlying biology and (iii) overlaying an activation/saliency map on the image to indicate the region of the image which influenced the model's prediction and (iv) identifying patients which had a similar clinical pathway.
Most papers derive their performance metrics from the test data alone with an unstated operating point to calculate sensitivity and specificity. Clinical judgment should be used to identify the desired sensitivity or specificity of the model and the operating point should be derived from the development data. The differences in the sensitivity and specificity of the model should be recorded separately for the validation and test data. Using an operating point of 0·5 and only reporting the test sensitivity and specificity fails to convey the reliability of the threshold. This is a key aspect of generalisability. Omitting it would see an FDA 510K submission rejected.

Recommendations for replicability.
A possible ambiguity arises due to updating of publicly available datasets or code. Therefore, we recommend that a cached version of the public dataset be saved, or the date/version quoted, and specific versions of data or code be appropriately referenced. (Git commit ids or tags can be helpful for this purpose to reference a specific version on GitHub, for example.) We acknowledge that although perfect replication is potentially not possible, details such as the seeds used for randomness and the actual partitions of the dataset for training, validation and testing would form very useful supplementary materials.

Recommendations for authors.
For authors, we recommend assessing their paper against appropriate established frameworks, such as RQS, CLAIM, TRIPOD, PROBAST and QUADAS [98][99][100][101][102] . By far the most common point leading to exclusion was failure to state the data pre-processing techniques in sufficient detail. As a minimum, we expected papers to state any image resizing, cropping and normalisation used prior to model input, and with this small addition many more papers would have passed through the quality review stage. Other commonly missed points include details of the training (such as number of epochs and stopping criteria), robustness or sensitivity analysis, and the demographic or clinical characteristics of patients in each partition.

Recommendations for reviewers.
For reviewers, we also recommend the use of the checklists 98-102 in order to better identify common weaknesses in reporting the methodology. The most common issues in the papers we reviewed was the use of biased datasets and/or methodologies. For non-public datasets, it may be difficult for reviewers to assess possible biases if an insufficiently detailed description is given by the authors. We strongly encourage reviewers to ask for clarification from the authors if there is any doubt about bias in the model being considered. Finally, we suggest using reviewers from a combination of both medical and machine learning backgrounds, as they can judge the clinical and technical aspects in different ways.
Challenges and opportunities: Models developed for diagnosis and prognostication from radiological imaging data are limited by the quality of their training data. While many public datasets exist for researchers to train deep learning models for these purposes, we have determined that these datasets are not large enough, or of suitable quality, to train reliable models, and all studies using publicly available datasets exhibit a high or unclear risk of bias. However, the size and quality of these datasets can be continuously improved if researchers world-wide submit their data for public review. Because of the uncertain quality of many COVID-19 datasets, it is likely more beneficial to the research community to establish a database which has a systematic review of submitted data than it is to immediately release data of questionable quality as a public database.
The intricate link of any AI algorithm for detection, diagnosis or prognosis of COVID-19 infections to a clear clinical need is essential for successful translation. As such, complementary computational and clinical expertise, in conjunction with high quality healthcare data, are required for the development of AI algorithms. Meaningful evaluation of an algorithm's performance is most likely to occur in a prospective clinical setting. Like the need for collaborative development of AI algorithms, the complementary perspectives of experts in machine learning and academic medicine were critical in conducting this systematic review.
Limitations: Due to the fast development of diagnostic and prognostic AI algorithms for COVID-19, at the time of finalising our analyses, several new preprints have been released; these are not included in this study.
Our study has limitations in terms of methodologic quality and exclusion. Several high-quality papers published in high-impact journals -including Radiology, Cell and IEEE Transactions on Medical Imaging -were excluded due to the lack of documentation on the proposed algorithmic approaches. As the AI algorithms are the core for the diagnosis and prognosis of COVID-19, we only included works that are reproducible. Furthermore, we acknowledge that the CLAIM requirements are harder to fulfil than the RQS ones, and the paper quality check is therefore not be fully comparable between the two. We underline that several excluded papers were preprint versions and may possibly pass the systematic evaluation in a future revision.
In our PROBAST assessment, for the "were there a reasonable number of participants?" question of the analysis domain, we required a model to be trained on at least 20 events-per-variable for the size of the dataset to score a low risk of bias 101 . However, events-per-variable may not be a useful metric to determine if a deep learning model will overfit. Despite their gross over-parameterisation, deep learning models generalise well in a variety of tasks, and it is difficult to determine a priori whether a model will overfit given the number of training examples 103 . A model which was trained using less than 500 COVID-19 positive images was deemed to have a high risk of bias in answer to this and more than 2000 COVID-19 positive images qualified as low risk. However, in determining the overall risk of bias for the analysis domain we factor in nine PROBAST questions, so it is possible for a paper using less than 500 images to achieve at best an unclear overall risk of bias for its analysis. Similarly, it is possible for papers which have over 2000 images to have an overall high risk of bias for their analysis if it does not account for other sources of bias.

Conclusions
This is the first systematic review to specifically consider the current machine learning literature using CT and CXR imaging for COVID-19 diagnosis and prognosis which emphasises the quality of the methodologies applied and the reproducibility of the methods. We found that no papers in the literature currently have all of: (i) a sufficiently documented manuscript describing a reproducible method; (ii) a method which follows best practice for developing a machine learning model, and (iii) sufficient external validation to justify the wider applicability of the method. We give detailed significant recommendations for data curators, machine learning researchers, manuscript authors and reviewers to ensure the best quality methods are developed which are reproducible and free from biases in either the underlying data or the model development.
Despite the huge efforts of researchers to develop machine learning models for COVID-19 diagnosis and prognosis, we find methodological flaws and significant biases throughout the literature, leading to highly optimistic reported performance. In their current reported form, none of the machine learning models included in this review are likely candidates for clinical translation for the diagnosis/prognosis of COVID-19. Higher quality datasets, manuscripts with sufficient documentation to be reproducible and external validation are required to increase the likelihood of models being taken forward and integrated into future clinical trials to establish independent technical and clinical validation as well as cost-effectiveness.

Methods
The methods for performing this systematic review are registered with PROSPERO [CRD42020188887] and were agreed by all authors before the start of the review process, to avoid bias.
Search strategy and selection criteria: We have followed the PRISMA checklist 104 and include this in Supp. Mat. C. We performed our search to identify published and unpublished works using the arXiv and the "Living Evidence on COVID-19" database 105 , a collation of all COVID-19 related papers from EMBASE via OVID, MEDLINE via PubMed, bioRxiv and medRxiv. The databases were searched from Jan 1, 2020 through to October 3, 2020. The full search strategy is detailed in the appendix. The initial cut-off is chosen to specifically include all early COVID-19 research, given that the World Health Organisation was only informed of the "pneumonia of unknown cause" on Dec 31, 2019 106 . An initial search was performed on May 28, 2020, with updated searches performed on June 24, 2020, August 14, 2020, August 15, 2020 and 3 October 2020 to identify any relevant new papers published in the intervening period. Since many of the papers identified are preprints, some of them were updated or published between these dates; in such cases, we used the preprint as it was at the later search date or the published version. Some papers were identified as duplicates ourselves or by Covidence 107 ; in these instances we ensured that the latest version of the paper was reviewed. We used a three-stage process to determine which papers would be included in this review. During the course of the review, one author (A. A.-R.) submitted a paper 108 which in scope for this review, however we excluded it due to the potential for conflict of interest.
Title and abstract screening. In the first stage, a team of ten reviewers assessed papers for eligibility, screening the titles and abstracts to ensure relevance. Each paper was assessed by two reviewers independently and conflicts were resolved by consensus of the ten reviewers (see Supp. Mat. A4).
Full-text screening. In the second stage, the full text of each paper was screened by two reviewers independently to ensure that the paper was eligible for inclusion with conflicts resolved by consensus of the ten reviewers.
Quality review. In the third stage, we considered the quality of the documentation of methodologies in the papers.
Note that exclusion at this stage is not a judgement on the quality or impact of a paper or algorithm, merely that the methodology is not documented with enough detail to allow the results to be reliably reproduced.
At this point we separated machine learning methods into deep learning methods and non-deep learning methods (we refer to these as traditional machine learning methods). The traditional machine learning papers were scored using the Radiomic Quality Score (RQS) of Lambin et al. 98 , while the deep learning papers were assessed against the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) of Mongan et al. 99 . The ten reviewers were assigned to five teams of two: four of the ten reviewers have a clinical background and were paired with non-clinicians in four of the five teams to ensure a breadth of experience when reviewing these papers. Within each team, the two reviewers independently assessed each paper against the appropriate quality measure. Where papers contained both deep learning and traditional machine learning methodologies, these were assessed using both CLAIM and RQS. Conflicts were resolved by a third reviewer.
To restrict consideration to only those papers with the highest quality documentation of methodology, we excluded papers that did not fulfil particular CLAIM or RQS requirements. For the deep learning papers evaluated using the CLAIM checklist we selected eight checkpoint items deemed mandatory to allow reproduction of the paper's method and results. For the traditional machine learning papers, evaluated using the RQS, we used a threshold of 6 points out of 36 for inclusion in the review along with some basic restrictions, such as detail of the data source and how subsets were selected. The rationale for these CLAIM and RQS restrictions is given in Supp. Mat. A7. If a paper was assessed using both CLAIM and RQS then it only needed to pass one of the quality checks to be included.
In a number of cases, various details of pre-processing, model configuration or training setup were not discussed in the paper, even though they could be inferred from a referenced online code repository (typically GitHub). In these cases, we have assessed the papers purely on the content in the paper, as it is important to be able to reproduce the method and results independently of the authors' code.

Risk of bias in individual studies:
We use the Prediction model Risk Of Bias Assessment Tool (PROBAST) of Wolff et al. 101 to assess bias in the datasets, predictors and model analysis in each paper. The papers that passed the quality assessment stage were split amongst three teams of two reviewers to complete the PROBAST review. Within each team, the two reviewers independently scored the risk of bias for each paper and then resolved by conflicts any remaining conflicts were resolved by a third reviewer.
Data analysis: The papers were allocated amongst five teams of two reviewers. These reviewers independently extracted the following information: (i) whether the paper described a diagnosis or prognosis model; (ii) the data used to construct the model; (iii) whether there were predictive features used for the model construction; (iv) the sample sizes used for the development and holdout cohorts (along with the number of COVID-19 positive cases); (v) the type of validation performed; (vi) the best performance quoted in the paper for the validation cohort (whether internal, external or both), and (vii) whether the code for training the model and the trained model were publicly available. Any conflicts were initially resolved by team discussions and remaining conflicts were resolved by a third reviewer.
Role of the funding source: The funders of the study had no role in the study design, data collection, data analysis, data interpretation, or writing of the manuscript. All authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.   al. 84 and Zhang et al. 30 from the figure as they use significantly more testing data (14,182 and 5869 images respectively) than other papers, and there are a large number of images (1,237) in the testing dataset in Wang et al. 57 which are unidentified in the paper (we include these in the unspecified COVID-19 negative).
Unclear in the paper.
Unclear in the paper.
Internal holdout validation and external validation.
Unclear in the paper.
Unclear in the paper.

Diagnosis
CXR and CT DL Unclear in the paper.
Unclear in the paper.