Introduction

For decades, computer-aided diagnosis (CAD) algorithms have made use of computer-extracted tumour characteristics for improved disease detection and diagnosis, treatment planning, and follow-up1, with some particularly notable developments in breast and lung cancer screening2,3. More recently, radiomics, involving high-throughput computer-extracted quantitative characterization of healthy or pathological structures and processes as captured by in vivo medical imaging, has emerged as an extension of CAD4. Similar to other ‘omics’ technologies, the extraction of such large quantities of information from images obtained during standard clinical workflows enables extensive tumour characterization and facilitates assessments of both within-tumour and between-tumour heterogeneity and longitudinal changes1. Interest in both CAD and radiomics (two terms that are occasionally used interchangeably) has increased substantially within the past two decades; a PubMed search for “(computer-aided diagnosis) OR CAD OR radiomic OR radiomics AND (cancer OR tumours OR tumours)” yields over 44,000 publications since 1967, over 85% of which are from 2005 onwards (Fig. 1).

Fig. 1: Number of publications per year since 1967.
figure 1

A PubMed search of “(computer-aided diagnosis) OR CAD OR radiomic OR radiomics AND (cancer OR tumours OR tumours)” was performed. The number of items published each year is presented as of 20 September 2022.

Similar to CAD, radiomics can assist with clinical decision-making. Radiomic features, namely measurements extracted from medical images (currently usually CT, MRI or digital radiography), are combined with data on clinical characteristics and from other omics analyses to detect disease, predict the likelihood of death, disease progression and/or recurrence by a specific time point, evaluate response to therapy or identify an appropriate course of treatment. The ultimate goal of radiomic analyses should be the development of a test, defined by the FDA–NIH Biomarker Working Group as a system comprising materials for measurement, procedures for measurement, and methods or criteria for interpretation5, that can be used to guide medical decision-making as in disease diagnosis and management.

Despite a dramatic increase in research output over the past two decades (Fig. 1), the vast majority of radiomic studies have not yet led to clinically useful tests. Across all medical indications, 343 artificial intelligence and machine learning-based tests currently have FDA clearance, only a small proportion of which are based on radiomics6. This lack of clinical translation might be attributable to several factors. The vast majority of radiomic studies assess correlations between certain radiomic features and a biological or clinical end point of interest; therefore, the added value of the radiomic test (such as improved clinical performance or reduced invasiveness) is often neglected as is clinical utility, namely that acting upon the information provided leads to a favourable benefit–risk balance for the patient. Additionally, as established in the statistical and machine learning literature, analyses of high-throughput data, such as those obtained using radiomics, are fraught with potential issues, including insufficient data for development and validation and improper application of statistical methodology for the specific purpose of the test. Furthermore, different studies have used widely varying protocols for image acquisition and feature extraction. Several studies have shown the effects of differences in data acquisition, image reconstruction and image post-processing on downstream analyses; different software platforms or even different versions of the same software can produce widely varying results regarding the strength and direction of the associations between features and outcomes7.

Existing guidelines on the acquisition and analysis of radiomic data include a radiomic quality score to evaluate the completeness and appropriateness of such an analysis8, computational procedures for commonly used types of features9, and protocols for image acquisition, feature extraction and statistical analysis10,11. However, radiomics would also benefit from a roadmap for the entire process of translating radiomic data into clinically useful tools for guiding clinical care, encompassing not only recommendations for image acquisition and processing, feature extraction, and statistical analysis but also aspects such as test lockdown and demonstrating clinical utility. Such a roadmap has yet to be published for radiomics, although similar criteria and guidelines have been compiled for other omics technologies12.

Herein, we present a 16-point list of criteria for the translation of radiomics into clinically useful tests. These criteria (Box 1) were developed by radiologists, physicists and statisticians with extensive experience with radiomics and other omics technologies, and are based on analogous recommendations developed for other omics technologies12. These criteria are also adapted to accommodate issues that are unique to radiomics, such as vendor-driven changes in imaging technology and software and the dynamic nature of certain models, and are intended to help researchers to navigate the translation process and catalyse an increase in the number of clinically useful radiomic tests.

Clinical application

Prior to any formal development and validation, the intended clinical use of the radiomic test and the target population should be established (criterion 1). The use of the test in clinical care should be expected to guide disease assessment and management decisions in a way that leads to a favourable benefit–risk tradeoff and offers advantages over other tests designed to serve the target population in the same role (criterion 2). The intended clinical use will have important implications for the subsequent stages of development and validation, including which features to extract from the imaging data, the optimal imaging time points and the design of the clinical trial to directly assess the performance of the test in its intended role.

Criterion 1: intended role and target population

Radiomics is often used for either screening or cancer diagnosis. For example, MRI radiomics is useful for the diagnosis of breast abnormalities13 and CT radiomics for the detection of lesions in various organs, including lungs, brain and prostate14. The use of radiomics in prognostication, namely predicting the clinical outcomes of patients undergoing standard therapy, is an area of increasing research interest15; for example, CT-based radiomics might be a useful method of predicting the outcomes of patients with head and neck squamous cell carcinomas or non-small-cell lung cancer receiving standard-of-care therapies16. Radiomic tests can also be used for treatment selection, namely as assays designed to indicate benefit, or lack thereof, from a specific class of therapies; for example, a model of oestrogen receptor expression based on tumour size, shape and entropy features on dynamic contrast-enhanced MRI (DCE-MRI) has been developed to inform treatment selection for patients with breast cancer17. Radiomic tests might also be used to assess response to treatment and monitor disease status18,19,20.

Roles in which radiomic tests could serve have been summarized comprehensively elsewhere21. In certain scenarios, the same radiomic test can have more than one role; for example, the aforementioned model of oestrogen receptor expression might also be useful for prognostication17. However, ‘off label’ use of radiomic tests, namely application in a role other than the one for which the test has been shown to be clinically useful, is discouraged. The criteria for clinical performance depend strongly on the intended role (see criterion 14) as is typical in the regulatory clearance and approval processes applied to both new drugs and medical devices. Diagnostic radiomic tests should have an adequate level of accuracy in detecting disease. Prognostic radiomic tests should have an adequate ability to predict death, disease recurrence or progression depending on the intended role of the test. Tests designed for therapy selection should also be sufficient to predict outcomes, such as death or disease progression, in patients receiving the therapy of interest. If the goal is to guide the choice between one treatment and a designated alternative approach, the outcomes of patients receiving each therapy need to be studied. However, if the predictive goal is merely to identify those patients who are most likely to respond to a particular therapy, then the test should have adequate ability to predict either a response or a level of expression of an established predictive biomarker sufficient to indicate a response to the treatment of interest. The translation process outlined in this Review should therefore be applied for each role in which a specific radiomic test is likely to be useful.

Aspects of the target population to specify include those pertaining to disease characteristics (such as primary tumour types and grades, disease stage, molecular subtypes, risk groups and receptor expression status) and treatment history. A radiomic test might also be useful in multiple target populations; the test based on the model described by Aerts et al.16, for example, might be useful for predicting the outcomes of patients with head and neck cancer or non-small-cell lung cancer receiving standard-of-care therapies. However, researchers are encouraged not to assume, without appropriate evidence, that the utility of a radiomic test extends across target populations because the technical performance of the imaging device and feature extraction software and the clinical performance of the test might not be consistent across different populations.

Criterion 2: patient benefit from use of the test in clinical care

The benefit of using a radiomic test should be clearly specified in the context of available treatments for the target population and access to other tests serving similar roles. A radiomic test might be used to stratify patients to optimize the choice of therapy for each individual, thus sparing patients from receiving ineffective or unnecessary treatments. A predictive test designed to guide treatment selection might differentiate between patients who are likely to derive clinical benefit (such as a longer median progression-free survival (PFS) or overall survival duration) from a specific therapy or class of therapies and those that will not. A prognostic test could identify patients with particularly poor outcomes on standard-of-care therapy who might consider a more intensive regimen; however, such a test will probably only be useful if a suitable, alternative treatment is available22. Moreover, a radiomic test might help to direct clinical management in a way that treatment-related toxicities, including financial toxicities, are reduced; prognostic tests might also identify patients whose outcomes on standard well-tolerated regimens are so good that they need not consider additional highly aggressive or toxic treatments or may consider treatment de-escalation.

The decision to use a radiomic test over other tests addressing the same clinical problem should be supported by a compelling reason. The radiomic test could have superior clinical performance to a standard test serving in the same role. The radiomic test might be able to identify underlying characteristics that cannot be detected as easily using other means; for example, assessing intratumour and intertumour heterogeneity of oestrogen receptor expression might be much less difficult when using radiomic tests compared with immunohistochemistry assays. Alternatively, the radiomic test might have a similar level of clinical performance but reduced invasiveness (such as biopsy avoidance), a reduced financial burden, greater convenience, or a reduction of one or more associated risks (potential harms, discomforts or exposures inherent to the testing procedure).

Imaging and feature extraction

Standard operating procedures for imaging, including protocols for the administration of contrast or imaging agents, specifications for image acquisition, procedures for image processing, and the timing of the scans should be in place (criterion 3) as should those for feature extraction, including a list of quantities to compute from the imaging data, segmentation algorithms, and computational algorithms and software to compute these quantities (criterion 4). The resulting feature measurements should also have been shown to have adequate technical validity (criterion 5). In most cases, this would entail each feature exhibiting strong repeatability and reproducibility or, if feasible, robust agreement with a standardized reference measurement of the underlying characteristic. A procedure to correct feature measurements for technical artefacts (the effects of factors such as imaging centre, device, operator or device-calibration settings on the distribution of the feature measurements) should also have been developed (criterion 6).

Criterion 3: standard operating procedures for image acquisition and processing

Image acquisition parameters should be specified in order to optimize image quality (for example, by keeping imaging noise to an acceptably low level or ensuring that the spatial, contrast and/or temporal resolution is adequate) and should be standardized to maximize reproducibility across imaging centres, devices and operators. Numerous studies have demonstrated the strong dependency of the resulting feature measurements on the imaging protocol23,24. Standard operating procedures for image acquisition could be based on established imaging guidelines such as those provided by the American College of Radiology25, the Society of Nuclear Medicine and Molecular Imaging26, the European Association of Nuclear Medicine27 or the Quantitative Imaging Biomarker Alliance28.

Image acquisition protocols will depend on the intended use as well as on the imaging modality and features that will be extracted. If the radiomic test is intended for diagnosis, spatial resolution will be an important consideration29. In theory, tests involving the analysis of morphological features will depend more on spatial resolution30, whereas kinetic features, such as those derived from fast DCE-MRI, will depend more on temporal resolution31. In practice, perfect standardization is infeasible as is optimization of the protocol with respect to all the features to be extracted. The optimal resolution in DCE-MRI for breast cancer diagnosis often involves a compromise between spatial and temporal parameters to obtain measurements of morphological and kinetic features with adequate technical validity32. Furthermore, image acquisition protocols, particularly those applied to standard-of-care imaging approaches, are often determined in an ad hoc manner.

The time points at which patients undergo imaging should also be specific to the intended use. Radiomic tests intended for treatment selection will involve scans obtained prior to intervention. Those intended for response assessment will involve scans obtained not only prior to the intervention but also at specified time points during and after therapy (often termed ‘delta-radiomics’33). The timing of response assessment can vary substantially; for example, radiomic tests designed to measure early metabolic response could involve imaging at baseline and then at a certain number of days to weeks following the initiation of treatment34 whereas assessment of the effects of certain classes of therapies, such as anti-vascular agents, might occur in a timescale of hours to days31.

Standard operating procedures should include processes designed to normalize the intensity values of images obtained from different patients and from the same patient. Normalization techniques include image resampling with filtering35, normalizing voxel intensity values relative to a histogram or global and local intensities on a reference image36,37, or harmonizing across different scans obtained from different populations or acquisition sites38. For certain features (such as second-order textural features), discretization through methods such as grey-level resampling and histogram binning is also needed11,39. Although the grey-level and standardized uptake value discretization methods used vary from centre to centre, these values can be normalized relative to a reference set of measurements40,41. Alternatively, standardized image preprocessing methods can be applied42. A comprehensive summary of imaging harmonization methods is provided elsewhere43.

Criterion 4: standard operating procedures for feature extraction

Prior to formal test development, a list of quantities that will be extracted from the imaging data should be established. Traditionally, radiomic features are human-engineered and are extracted through delineation of the tumour from surrounding tissues using manual, semi-automated or fully automated segmentation44,45,46 followed by application of pre-specified computational procedures to the voxel data within the region of interest10. Human-engineered features include those quantifying size (tumour dimension), shape (3D geometry), morphology (margin characteristics), enhancement texture (the extent of heterogeneity within the texture of the tumour and/or contrast uptake), quantifications of kinetic curves (shape of the curve and quantifications of the physiological process of uptake and washout of the contrast agent) and enhancement-variance kinetics (such as the time course of spatial variance of enhancement within the tumour)47,48,49,50.

Extraction of such features will typically involve conversion and harmonization of the imaging data (criterion 6), post-processing (such as interpolation to cubic voxels, denoising, and correction of intensity and partial volume effects), image segmentation, region-of-interest extraction, and feature computation9. Existing guidelines and recommendations can serve as a starting point for the development of a standard operating procedure for feature extraction but will often require adaptation to suit both the target population and the imaging modality51.

Alternatively, features of interest can be computer learned, namely extracted by direct application of computer algorithms to voxel data without the need for human intervention such as those computed using deep learning networks52,53. In this approach, a deep learning network can be applied to the voxel-level data and the last layer of the underlying convolutional neural network is taken as a set of features, similar to those used by Li et al. to predict IDH1 mutation status in patients with low-grade gliomas54. An illustration of the differences between such features and human-engineered ones is provided in Fig. 2. Computer-learned features have been considered in conjunction with operator-dependent features55 or even as a replacement. Such features are often less transparent in their computation and less interpretable; nonetheless, they might capture information that human-engineered features cannot, often resulting in more reproducible feature extraction and models with improved performance54. Fully automated extraction of such features enables the processing and computation of larger volumes of data with reductions in the variability of test output values owing to the elimination of human error during processes such as manual delineation and segmentation52.

Fig. 2: Types of radiomic analysis.
figure 2

a, Analyses using human-engineered features. Different types of features (such as histogram, shape or texture) are extracted from the images according to a pre-specified computational procedure. Variable selection techniques are used to identify which of these features are important in diagnosing a medical condition. The values of these selected variables are combined into a model to produce a diagnosis. b, Analyses using machine learning and artificial intelligence algorithms. The voxel-level data are fed into a convolutional neural network consisting of multiple hidden layers whose output is used to produce a diagnosis.

Criterion 5: technical validity of the feature measurements

Adequate technical validity typically entails assessing the repeatability and reproducibility of the feature measurements. Repeatability describes the precision when a specific imaging and feature extraction standard operating procedure are applied multiple times to the same patient at the same centre by the same operators within a short period of time. Reproducibility describes the precision of repeat measurements when factors such as imaging centre and operator are allowed to vary56,57,58. Study designs and the statistical methodology for studies assessing repeatability and reproducibility have been summarized in detail elsewhere59. Strong technical validity is important for model development and the establishment of the clinical utility of a radiomic test given that poor feature reproducibility, as mentioned previously, can produce widely varying results regarding the strength and direction of the association between features and outcomes7 and result in models with insufficient levels of performance60.

Ideally, repeatability and reproducibility would be assessed using clinical data. In such clinical studies, patients undergo repeat scans with the feature extraction standard operating procedure then applied to each image. Such studies have been conducted61,62, although they are often difficult in practice as patients can be reluctant to participate owing to a lack of direct benefit, the inconvenience of undergoing multiple scans and, with certain techniques, additional exposure to contrast agents or ionizing radiation. An alternative approach involves different operators extracting features from the same set of images, possibly at different centres; however, this approach, although also feasible as a retrospective method, only enables the assessment of variability attributed to the feature extraction process51,56,58.

As an alternative approach, some components of technical validation can be conducted using in vitro or in silico phantoms, simulated digital reference images or synthetic data such as those produced by generative adversarial network systems40,63. However, conclusions based on data obtained using phantoms and digital reference images will be overly optimistic regarding their technical validity given that they cannot fully capture the complexity of actual patients. Several authors have provided recommendations on the minimum technical validity requirements of phantoms and digital reference images59,64.

Technical validity can also be assessed using the level of agreement between feature measurements and certain comparator quantities (for example, with a measurement of the underlying biological characteristic according to an independent in vitro assay), bias (the mean difference between the measurement and the true value of the characteristic being measured), and the linearity of the relationship between the feature measurement and the true value59. However, assessing agreement is often not possible for computer-learned features owing to a lack of an appropriate biological correlate. Assessing bias or the relationship between the measurement and the true value of the feature being measured is generally only possible with phantoms and digital reference images.

Repeatability and reproducibility can be used as screening criteria to immediately eliminate features with poor technical validity from further consideration for inclusion in the model. Filtering out features in this manner has been shown to improve the level of power in settings with large numbers of features, of which only a small proportion are associated with the outcome of interest65. Such filtering must be done solely on the basis of technical validity and must not use outcome data that will also be used to assess performance of the model under development (criterion 9).

Technical validity criteria are much less well developed for computer-learned features such as those described by Li et al.54; their methodology produced 16,384 dimensional descriptors arranged in 128 × 64 × 2 arrays, for which applying the technical validity assessment methods described above is clearly not feasible. Regardless of the type of feature used, researchers are encouraged to assess the technical validity of the output of any radiomic models based on these features (criterion 12).

Criterion 6: feature measurement correction for technical artefacts

Technical artefacts, namely the effects of factors related to variables such as imaging centre, operator and/or device configurations on the distributions of the feature measurements, can potentially confound the results of subsequent radiomic analyses. For example, a feature with no association with survival might seem to predict outcome if patients who undergo imaging in one location have substantially better outcomes than those undergoing imaging in another centre and if the median feature measurement differs between the two sites owing to variations in image acquisition and processing. Thus, procedures designed to correct the variations in feature measurements created by such factors should be established prior to the development and validation of a radiomic model.

In addition to the image normalization methods described previously (criterion 3), the feature measurements themselves can also be standardized following extraction. These measurements can be normalized relative to a reference set of measurements66 or according to a harmonization model67,68, similar to the approaches used in other omics settings. Features strongly associated with variation from these technical artefacts might then be removed from consideration before model construction69,70.

Model development and validation

Patient-level data, including images, outcomes, standard clinical variables, measurements of in vitro biomarkers and other relevant data, should be obtained from the target population; these data can be obtained prospectively or retrospectively from already completed studies, imaging repositories or health-care databases (criterion 7). A radiomic model should be developed using appropriate statistical or machine learning techniques incorporating safeguards designed to avoid overfitting (criterion 8). The performance of a model in predicting an end point of interest must be shown to be adequately robust using proper model validation techniques (criterion 9). By the end of its development, all aspects of the radiomic test, including the feature preprocessing steps, mechanisms of imputing missing data, the underlying computational procedures, any cut points in the feature measurements themselves and/or the model outputs, must be fully specified (criterion 10). Each possible output value of the test is then linked to an unambiguous interpretation with regard to clinical care (criterion 11) and the reproducibility of these outputs should be shown to be sufficiently strong (criterion 12). Processes designed to address drift in the performance of the radiomic test, which refers to changes arising from factors such as the evolution of image acquisition and processing protocols and feature extraction procedures over time, software upgrades and obsolescence, and replacement of devices with newer models, should be established, including monitoring processes and procedures to perform further technical validation and model adjustment as necessary (criterion 13).

Criterion 7: imaging, outcome and other relevant data from the target population

Data on the performance of radiomic analyses can be acquired prospectively, most often as part of a clearly stated secondary objective in a phase II or phase III trial involving the target population, with standard operating procedures for image acquisition and processing at the desired time points and a feature extraction protocol, guided by the points described previously, written into the protocol. Alternatively, data can be acquired retrospectively from imaging data repositories, health-care databases, or datasets from completed clinical trials, subject to inclusion and/or exclusion criteria involving image acquisition and processing protocols, image quality, and the availability of images acquired at the relevant time points. For example, The Cancer Genome Atlas Breast Imaging Research Group identified patients from The Cancer Imaging Archive repository71,72 for whom gene expression analysis and pretreatment standard-of-care breast MRIs obtained with 1.5 Tesla GE Medical Systems devices were available17,18,73. Any clinical data to be obtained should be matched with the images via unique patient ID numbers.

Sample sizes should be determined according to factors such as the number of events (patients with disease versus without, or observed number of deaths), the type of model to be fitted to the data, the expected strength of the relationship between the features and the outcome, the desired standard error of the performance metric, the variance of the model outputs and their concordance with observed event probabilities74,75,76. Logistic and Cox regression models constructed using data from too few patients often have lower performance relative to models constructed using larger sample sizes60. Deep learning classifiers can require data from thousands of patients per class owing to their complexity (in preprint77); however, dataset sizes can be reduced with the use of transfer learning through feature extraction or fine-tuning methods78. Smaller numbers of patients can be used for model fitting if the relationship between the features and the outcome is particularly strong. Notwithstanding, sample sizes are often constrained by the amount of data available from the completed studies, image repositories or databases from which they were acquired or, if the radiomic study is a secondary objective of a clinical trial assessing a therapeutic intervention, by the number of patients required to meet the primary objective, which will often be much smaller than what is needed for the radiomic analysis.

Ideally, prospective studies should involve multiple centres and retrospectively acquired data should be obtained from multiple studies or repositories and then combined. Using multiple imaging centres, as opposed to a single one, not only facilitates more rapid accrual of data and accumulation of a sufficient number of patients for reliable statistical modelling and validation but can also result in the acquisition of data from a broader population. However, this approach comes with the risk of introducing technical artefacts into the data that will need to be corrected prior to model development and validation (criterion 6).

Criterion 8: development of the radiomic model with guards against overfitting

The range of model-fitting techniques proposed in the statistical and machine learning literature has been described in detail elsewhere79. The literature suggests that no single model-fitting technique is uniformly superior to any other80 although, regardless of the approach used, care should always be taken to avoid overfitting, that is, fitting an overly complex model to noise in the data and thus producing a model that is only poorly predictive when applied to completely new data. Overfitting risk is high when using more complex models, such as those based on neural networks81 or non-parametric regression, as opposed to simpler ones such as those based on logistic or Cox regression. These simpler models have also been shown to often perform as well, if not better, than their more complex counterparts, especially when the number of variables is large and the underlying relationship between the radiomic features and the end point is neither strong nor complex45,60,82,83.

Inclusion of too many features in the model, which can be viewed as another form of model complexity, is another common cause of overfitting. Models based on high-dimensional data, such as those typically encountered in radiomic settings, are particularly prone to this issue. Eliminating any features with subpar levels of technical validity (such as poor reproducibility) or those associated with batch processing before any formal model development takes place (criteria 5 and 6) might reduce the likelihood of overfitting as will the use of variable selection techniques. Several authors have described common variable selection techniques in greater detail elsewhere79,84. Of note, many of these techniques require the selection of a tuning parameter controlling the stringency of the inclusion criteria for variables in the model (such as a P value below which univariate associations between individual features and the outcome must lie to be included, the number of variables to be included, or regularization parameters in LASSO regression techniques85). The optimal tuning parameter value is typically identified using the data (see criterion 9).

Criterion 9: model validation

Once a model has been developed, with mitigation against possible overfitting, the model should then be shown to be capable of predicting an end point of interest, be it a clinical event or state or a biological characteristic, with a sufficient level of accuracy. Robust model performance does not necessarily imply usefulness in guiding medical decision-making; for example, as mentioned previously, a radiomic test with a high level of diagnostic accuracy or a robust ability to predict treatment response or an end point of interest will not be clinically useful if the improvement in clinical performance is not substantial enough to justify its use over standard-of-care diagnostic workups. The broad principles described in this subsection, as well as those regarding lockdown, clinical validity and clinical utility in subsequent sections, apply to both more traditional human-engineered features and computer-learned features.

The area under the receiver operating characteristic curve86 (AUC) of the model outputs or their sensitivity and specificity can be used to quantify the ability of the model to discriminate between patients with a specific health condition from those without. A related metric to the AUC is the c-index87, which quantifies the ability of the model to predict survival (the probability that among two randomly chosen patients, the one with the higher model output has the shorter survival time). Additionally, assessments of model performance should include calibration, namely the concordance between the predicted and expected probabilities of an event of interest88,89,90. Calibration curves, namely plots of the observed frequencies of the event versus predicted probabilities, are also used to examine whether the model predictions are consistently either too high or too low91. As emphasized during the discussion of criterion 1, the most appropriate performance metric will depend on the intended use of the radiomic test.

Ideally, model validation should be accomplished by applying the newly developed model, without any alterations to any aspect, to a completely external dataset that was not used in any part of the model development process. External data should be acquired from patients in the target population from whom imaging data were obtained under similar imaging, processing and feature extraction protocols to the data used in model development. Variations in imaging centre, operating personnel, scan acquisition date, and certain methods of imaging and feature extraction (such as device and software version) between the training and validation datasets might be permitted to enable evaluation of the robustness of the model to variability in these factors.

However, adequate external validation is not always performed, primarily owing to the logistical challenges associated with accessing data from an independent cohort. In our experience, the performance of the model is often assessed through internal validation, namely the use of a single dataset for both model development and evaluation. Internal validation involves carefully splitting or subsampling the data to avoid overlap with the data used to develop the model (the training set) and those used to evaluate the performance of the model (the validation set). Internal validation can provide reasonable estimates of the predictive accuracy of the radiomic model, although results obtained in this way might not necessarily be generalizable to completely new data. If model development and internal validation were performed on data that were obtained using obsolete image acquisition and processing protocols or that involved a cohort that was not completely representative of the entire target population (such as patients from a location at which a disproportionate percentage had a poor prognosis), then the results will reflect performance in this setting; performance might be diminished in other settings such as those with updated image acquisition and processing protocols12.

Internal validation methods include split-sample validation92, cross-validation93 or bootstrap validation94; these various techniques have been summarized in detail elsewhere79,95. Cross-validation is usually preferable to split-sample validation when only small sample sizes are available; the latter produces estimates of model performance that are often pessimistically biased (that is, estimates of model performance that are substantially lower than those obtained from external validation) when sample sizes are of about 200 or fewer individuals60,96.

Appropriate internal validation requires the maintenance of strict separation of data used to specify any aspect of the model from those used to evaluate its performance. Any violation of this strict separation results in overly optimistic estimates of the performance97,98. In this regard, full resubstitution, in which the entire dataset is used for both development and validation of the same model, provides the most egregious example. Partial cross-validation, in which the entire dataset is used to select features based on their significant univariate association with outcome followed by cross-validation of the model using only this restricted feature set, is another variant of this inappropriate approach to validation. In a comprehensive review of internal validation approaches, data from simulation studies are presented indicating that, even in a scenario in which the variables have no relationship with an outcome, inappropriate internal validation techniques can still produce an AUC estimate of 0.7–0.8 (ref.97).

The selection of tuning parameters during model development (criterion 8) is yet another stage at which problems in model validation can arise. Often, for each candidate from a list of tuning parameter values, the model is fitted using the training set and then applied to the validation set to obtain a performance metric estimate. The tuning parameter value associated with the optimal performance metric estimate is then identified and this metric estimate is then reported. However, in this approach, some aspects of the model development (the identification of the tuning parameter) took place on data used to estimate the performance metric. Such approaches can lead to biased estimates of the performance metric98. Appropriate validation techniques for use when tuning parameter selection is also involved include a three-way split of the data into training, validation and test sets (the training and validation sets are used to identify the tuning parameter and fully specify the model, which is then applied to the test set to obtain a performance metric estimate)92 or nested cross-validation98.

Criterion 10: radiomic test lockdown

Once the model has been developed and shown to have reasonable predictive accuracy, all components of the test, as described in the Introduction of this Review, should be locked down. In radiomics, procedures for measurement will include both standard operating procedures for image acquisition and processing (criterion 3) as well as those for feature extraction (criterion 4) and calculation of model output. Outputs are then associated with specific clinical interpretations (criterion 11).

All computational aspects of the model (for example, the mathematical expression, including regression coefficients, weightings, cutoffs and any other parameters) should be locked down to the greatest extent possible. In situations in which concise model descriptions are not feasible, such as for those based on deep learning, the underlying computational algorithm and software platform should be closed to further changes and any crucial components, such as the random number generator seeds used to generate the model or the output, should be fixed. Interpretations of the inputs of the model (for example, the variables included in a logistic or Cox regression model involving human-engineered features) are often of interest to researchers as they can provide insights into the degree of importance of each feature in predicting an outcome. For computationally derived model inputs, such as features obtained using deep learning algorithms, methods to aid interpretability include visualizing the latent space discovered through the learning process, post hoc highlighting of the regions of the input images that the model labelled as important and visualization of features from different filters in the convolutional neural network99.

The locked-down model could still be affected by any remaining biases inherent to the data on which it was fitted and validated (such as technical artefacts and distributions of radiomic feature values and outcomes that differ substantially from those of the target population). Allowing the model to evolve over time as new data become available will alleviate some of these effects (criterion 13).

Criterion 11: interpretation of test outputs

Models based on techniques such as support vector machines will produce outputs consisting of discrete categories78, each of which can be linked to a specific clinical interpretation and decision. However, models constructed via most other techniques will produce a quantitative output such as the predicted probability of a specific event of interest. Binning these continuous outputs into a limited number of discrete categories might be desired for the purposes of interpretation and clinical decision-making. For example, a test output value that falls below a prescribed cutoff value might indicate a good prognosis and that additional treatment will not be needed and/or that the likelihood of a response to a treatment is high. Alternatively, a test output value above a prescribed cutoff could indicate a high risk of mortality and that the patient might survive longer on an alternative regimen.

Sometimes, these cutoffs are set arbitrarily to specific quantiles, such as the median, in order to define high-risk versus low-risk groups; however, this approach ignores associations with clinical outcomes. Cutoff optimization and comparisons of the outcomes of patients in each category defined by the cutoffs should be done on separate datasets so as not to violate the principle of separation of data used for model development from those used for validation. When cutoff optimization and outcome comparisons are done using the same data (for example, by applying various cutoffs to a dataset, computing the log-rank test P values of the resulting groups and choosing the cutoffs associated with the lowest P values), the risk of a type I error is increased100. To ensure the test can be applied to one patient at a time, cutoff values should be specified as absolute values rather than as percentiles that would need to be recalculated on the availability of new data.

Analytical approaches that consider the consequences of specific treatment decisions based on the test output have also been proposed as a method for cutoff selection. These methods aim to balance the risks (adverse consequences) of incorrect test results against the benefits (positive consequences) of correct test results. The risk–benefit balance can then be compared to that of the standard-of-care approach for the specific clinical indication or any other competing tests or to the use of no test at all. Such approaches include the decision curve analysis method101. This methodology has been applied to a radiomic study involving features obtained from preoperative CT images in conjunction with images from intraoperative frozen sections and clinical data to differentiate invasive lung adenocarcinomas from preinvasive lesions or minimally invasive adenocarcinomas102.

Criterion 12: test output reproducibility

The reproducibility of the test outputs should be shown to be sufficiently robust to ensure that the radiomic test will produce similar results regardless of where it is performed or by whom. One approach involves having patients undergo repeat scans using an established standard operating procedure without interventions in between. Multiple operators, also possibly at different imaging centres, would then apply feature extraction according to standard operating procedures and the radiomic model to the repeat scans independently of one another. Finally, the model or algorithm underlying the test is applied to the images and feature data. Reproducibility metrics for individual features can also be considered at this stage.

This assessment of reproducibility encompasses variability potentially owing to all aspects of image acquisition and processing, feature extraction, and application of the model; however, this approach is rarely feasible in practice for reasons that include a lack of availability of repeat imaging in many scenarios and the unwillingness of many patients to undergo multiple scans within a short space of time. Alternatively, both the feature extraction process and the model can be applied repeatedly to the same set of images, possibly by different operators at different locations. This approach can be applied to retrospectively acquired data but can only produce an assessment of reproducibility that encompasses variability owing to feature extraction and application of the model (and not factors that influence raw data acquisition). If estimates of the repeatability and reproducibility of individual features are known, error propagation models and simulation approaches can be used to estimate the reproducibility of the test output60.

Criterion 13: processes to address data and radiomic test drift

The computational procedures underlying most radiomic tests are likely to evolve over time. Imaging hardware and computational software are likely to be upgraded. Furthermore, the model itself could change after fitting to new data102. Monitoring for such changes in a way that enables their effects to be assessed should be in place. Certain changes might also require a return to previous steps in model development and validation. Changes not related to drift, such as application of the test in a different patient population or indication or the addition of new features, should necessitate a return to model development and validation and might also require the re-establishment of standard operating procedures for feature extraction with re-assessments of the technical validity of individual features.

Assessments of technical and clinical validity and clinical utility (criteria 14 and 15) should be performed periodically for tests for which the underlying computational procedure is expected to evolve over time. Changes to the standard operating procedures for image acquisition and processing or upgrades to the feature extraction software should be followed by assessments of the level of agreement between feature measurements obtained under the previous and the new versions. Researchers should proceed with the new versions of the standard operating procedures and software platforms based on the degree of agreement; however, empirical guidelines on what constitutes a sufficiently strong level of agreement are not available and are probably dependent on both the feature itself and the context. If this agreement in feature measurements is inadequate, the level of concordance between the test outputs computed using the two versions can be assessed (for example, by demonstrating that the mean squared difference of the outputs from the two versions is lower than some meaningful threshold). High concordance between the test outputs indicates that the two versions produce similar results and that the new one could therefore safely replace the previous one, although poor concordance might also reflect the superior clinical performance of the new version. In some scenarios, the model might need to be refitted; such changes can alter the significance and sometimes even the direction of the association of the features with an outcome of interest7.

Justifying use in clinical care

Robust performance of the underlying model in predicting an end point of interest does not automatically mean that the test will be clinically useful (meaning that acting upon the results of the radiomic test leads to patient benefit via improved outcomes or quality of life, reduction in toxicity, invasiveness, risk of complications, financial burden, or the avoidance of ineffective or unnecessary treatments). After the radiomic model has been validated and the test has been locked down, its clinical validity, namely the ability of the outputs to provide information regarding the presence or absence of a condition or the risk of an event of interest103 (for example, sensitivity, specificity, or positive and negative predictive values in detecting disease or the proportion of patients classified as low-risk who remain progression free at 5 years), should be assessed in the context of its intended use and clinical setting (criterion 14). The clinical utility of the test should then be assessed using a prospective study or an appropriately designed prospective–retrospective study, in which the performance of the test in its intended clinical setting is directly assessed (criterion 15) and the risk–benefit balance for the patient when acting upon the results of the radiomic test is shown to be sufficiently favourable to justify use in clinical care (criterion 16). Note that such scenarios often do not reflect the standalone performance of the radiomic model but rather how the test influences the end user (for example, the clinician) when they make clinical decisions with and without the test results as was often used in assessing CAD systems104,105.

Criterion 14: clinical validity of the test

Clinical validation goes beyond model validation (criterion 9) in that the former involves the evaluation of model performance with greater specificity to the clinical setting and intended use. For example, model validation of a prognostic radiomic test might involve showing that the level of concordance between overall survival and model outputs is above some pre-specified and meaningful threshold. Clinical validation, meanwhile, might involve demonstrating that patients who have been classified in a low-risk category have a very high (>90%) 5-year PFS on a well-tolerated standard therapy regimen whereas those in other risk categories have substantially worse outcomes. This may suggest that patients in the low-risk category may potentially consider foregoing additional highly invasive or toxic treatments. Alternatively, showing clinical validity of a prognostic radiomic test might entail demonstrating that the association between test output and clinical outcome remains statistically significant even after adjusting for standard clinical or pathological variables with known prognostic value. The robustness of such a finding to the effects of potential confounders, such as variations in the operator of the feature extraction or the imaging centre in which the extraction and test were performed, should also be established. Different approaches have been summarized in detail elsewhere21.

The radiomic test should be fully locked down and the data used to determine clinical validity should be independent from any data used in model development and validation. Such data could come from prospective clinical trials. For example, to estimate the 5-year PFS of patients with low-risk disease according to the radiomic test, such a cohort could receive standard-of-care therapy and comparisons of the outcomes of the different risk groups could be made after 5 years of follow-up monitoring. Alternatively, data might also be acquired retrospectively from completed clinical trials or imaging data repositories such as The Cancer Imaging Archive71 or the sequestered commons from Medical Imaging and Data Resource Center106, from which testing data can be drawn based on the clinical question and population of interest. Again, this approach assumes that imaging data for a sufficient number of patients from the target population were acquired using protocols similar to the previously established standard operating procedures (criterion 3).

Criterion 15: direct evaluation of performance of the test in its clinical use

The optimal design, end points and statistical analyses to assess the benefits of using a radiomic test to guide clinical disease management differ widely depending on the intended use of the test21. For example, for a radiomic test expected to outperform an in vitro prognostic assay currently in widespread use, patients whose treatment decisions were based on the radiomic test should be shown to have substantially improved outcomes compared to those of patients for whom clinical care was dictated by the in vitro assay.

Prospective studies have numerous desirable qualities, including enabling researchers to have full control over the features to measure, image acquisition and processing, the study design, and sample size. However, such studies are likely to be time consuming and costly, particularly for disease settings with already favourable outcomes that require a large sample size and/or lengthy follow-up duration to observe sufficient events (such as death, disease recurrence or progression) for adequate statistical power. Prospective–retrospective studies can reduce or even eliminate many of the delays and costs associated with image acquisition and follow-up assessments107. For prospective–retrospective studies, data from standard-of-care images, clinical outcomes and other data, such as standard clinical variables, are acquired from patients in completed clinical trials that satisfy the appropriate inclusion and/or exclusion criteria regarding the patient population, treatment approach, image acquisition and processing specifications, and availability of the necessary images. Both the feature extraction and the test are applied prospectively. Similar to a prospective study, the radiomic test, the statistical analysis plan, sample size, level of power, and the inclusion and exclusion criteria should be fully specified in a protocol before the initiation of a prospective–retrospective study. Criteria for establishing clinical utility through prospective–retrospective studies for other omics approaches have already been published12,107. These criteria include the stipulation that two such studies must produce similar results, an approach that can also be adapted for radiomic tests. In silico clinical trials using patient-specific models to develop a simulated cohort might provide an alternative approach108, although these simulated patients might not entirely reflect the complexities of real-life patients.

Criterion 16: benefit versus risk balance from use of the radiomic test

The benefit–risk balance associated with use of a radiomic test will encompass not only the risks and benefits associated with performing the test but also those associated with the clinical decisions directed by the test results. If the intended use of a test is to choose a therapy that provides superior clinical outcomes compared with other available options, then the improvement in clinical outcome should not only be statistically significant but also large enough in magnitude to justify use of the radiomic test. Alternatively, a favourable benefit–risk balance might emerge when use of the radiomic test leads to non-inferior outcomes while being associated with reduced risks, including those inherent in the standard testing procedure, or if the toxicities of unnecessary or ineffective treatment can be avoided. For example, even if the radiomic test leads to treatment decisions that are similar to those based on standard diagnostic workups, the former might nevertheless have clinical utility if the information it provides enables patients to undergo fewer subsequent scans or biopsies while still leading to similar outcomes.

Finally, a radiomic test does not have clinical utility if it separates patients into groups for which the outcomes are statistically different but the recommended clinical management would be the same. Even if one patient group has an inferior outcome on standard therapy, another treatment might be available that is more effective for that group.

Conclusions

The 16 recommended criteria provided herein aim to guide the translation of radiomic tests into clinically useful tools and are expected to be relevant across a range of imaging modalities and scenarios. Many of these recommendations share common themes with other published guidelines for radiomics; adherence to these recommendations addresses many components of the radiomic quality score4, for example.

The statistical considerations regarding model development and validation and the design of studies for the assessment of clinical utility have numerous parallels to those for in vitro test considerations12,109. Several components of our recommendations are based on these sources. However, some important and consequential differences specific to radiomics also merit consideration. Radiomic approaches increasingly utilize multiple machine learning and deep learning methods, which introduces new issues regarding standard operating procedures for feature extraction, test lockdown, machine learning interpretability, correlations with biology, regulatory considerations and assessments of analytical validity. These criteria are likely to further evolve in the future as researchers become aware of additional issues and as more radiomic models become locked down, validated and evaluated for clinical utility. We emphasize that these recommendations pertain to the conduct and analysis of radiomic studies and are not intended as reporting guidelines for radiomic and CAD studies in the vein of REMARK for tumour prognostic studies110 or other reporting guidelines catalogued by the EQUATOR project111. However, some of these recommendations are expected to serve as the basis of such radiomic-specific reporting guidelines.

Radiomics is increasingly likely to involve full machine learning-based image analysis such as deep learning-based features or the application of artificial intelligence and machine learning algorithms directly to voxel-level data. Such a transformation, as mentioned before, is expected to eliminate much of the variability created by human error and improve model performance in many scenarios, although it will also benefit from integration with clinical information to better personalize the test result to each patient. For example, this type of test might be used not only to detect cancer but also to do so in the presence of additional comorbidities (for example, examining a renal finding in the presence of diabetes mellitus, chronic inflammatory processes and/or hypertension). The increased availability of different types of data should facilitate these types of improvements.