Imaging biomarkers (IBs) are integral to the routine management of patients with cancer. IBs used daily in oncology include clinical TNM stage, objective response and left ventricular ejection fraction. Other CT, MRI, PET and ultrasonography biomarkers are used extensively in cancer research and drug development. New IBs need to be established either as useful tools for testing research hypotheses in clinical trials and research studies, or as clinical decision-making tools for use in healthcare, by crossing 'translational gaps' through validation and qualification. Important differences exist between IBs and biospecimen-derived biomarkers and, therefore, the development of IBs requires a tailored 'roadmap'. Recognizing this need, Cancer Research UK (CRUK) and the European Organisation for Research and Treatment of Cancer (EORTC) assembled experts to review, debate and summarize the challenges of IB validation and qualification. This consensus group has produced 14 key recommendations for accelerating the clinical translation of IBs, which highlight the role of parallel (rather than sequential) tracks of technical (assay) validation, biological/clinical validation and assessment of cost-effectiveness; the need for IB standardization and accreditation systems; the need to continually revisit IB precision; an alternative framework for biological/clinical validation of IBs; and the essential requirements for multicentre studies to qualify IBs for clinical use.
A biomarker is a “defined characteristic that is measured as an indicator of normal biological processes, pathogenic processes or responses to an exposure or intervention, including therapeutic interventions” (Refs 1,2). The current FDA–NIH Biomarker Working Group definition — adopted in this consensus statement — states explicitly that “molecular, histologic, radiographic or physiologic characteristics are examples of biomarkers” (Ref. 2). This approach seeks to clarify inconsistency in terminology, because some previous definitions have restricted the scope of biomarkers to describing biological molecules. Such narrow definitions regard values obtained from imaging and other techniques as measurements of an underlying biomarker, rather than being biomarkers themselves. However, the current FDA–NIH definition takes a broader view. Earlier definitions also stated that biomarkers should have 'putative' diagnostic or prognostic use3, although this requirement is no longer specified by FDA/NIH. Biomarkers must be measured, but can be numerical (quantitative) or categorical (either a quantitative value or qualitative data; for definitions, see Supplementary information S1 (table)).
The use of both imaging biomarkers (IBs) and biospecimen-derived biomarkers is widespread in oncology. In healthcare settings, biomarker uses include screening for disease; diagnosing and staging cancer; targeting surgical and radiotherapy treatments; guiding patient stratification; and predicting and monitoring therapeutic efficacy, and/or toxicity4. In research, biomarkers guide the development of investigational drugs as they progress along the pharmacological audit trail5, in which they can indicate the presence of drug targets, target inhibition, biochemical pathway modulation or pathophysiological alteration by investigational drugs; drug therapeutic efficacy in specific groups of patients; and tracking of drug resistance6. The use of biomarkers has led to the identification of potentially successful drugs early in the developmental pipeline, thereby accelerating market approval for some therapies and enabling drug developers to reduce overall costs by identifying ineffective or toxic compounds at the earliest opportunity5.
Despite some biomarkers being used extensively and others showing great potential7,8, a surprisingly limited number of biomarkers guide clinical decisions9,10,11. Some putative cancer biomarkers are not adopted because they do not measure a relevant biological feature nor enable disease diagnosis or outcome prediction. In such cases, the biomarker is appropriately devalidated12. Many other promising biomarkers, however, are neither devalidated nor qualified for use in research or healthcare settings and, instead, are confined in the academic literature without real application owing to a lack of efficient and effective strategies for biomarker translation13.
All biomarkers, including IBs, must cross two 'translational gaps' before they can be used to guide clinical decisions14,15 (Fig. 1). Biomarkers that can reliably be used to test medical hypotheses cross the first gap becoming useful 'medical research tools'; if the biomarker crosses the second gap then it becomes a 'clinical decision- making tool'. Some biomarkers that have only crossed the first translational gap are nevertheless highly useful in the development of therapies5,13.
Several publications have described strategies for developing and evaluating cancer biomarkers, focusing mainly on biospecimen-derived biomarkers — that is, those derived from patient tissue or biofluids4,16,17,18,19,20,21. The processes of initial discovery, validation and qualification share many similarities for IBs and biospecimen-derived biomarkers; however, substantial inherent differences exist between both biomarker types (Supplementary information S2 (table)). The FDA and NIH have recognized this distinction and have outlined specific recommendations for image acquisition and analysis in IB development22,23. Questions of how IB acquisition and analysis should be standardized, and how terminology should be harmonized have been addressed by numerous academic, clinical, industrial and regulatory groups. These groups include the FDA2,24, the US National Cancer Institute (NCI) through the Quantitative Imaging Network (QIN)25 and the Cancer Imaging Program phase I and II Imaging trials initiative26, the Quantitative Imaging Biomarkers Alliance (QIBA)27,28, the American College of Radiology Imaging Network (ACRIN)29, the European Society of Radiology (ESR)30, the European Organisation for Research and Treatment of Cancer (EORTC) through the QuIC-ConCePT consortium13,31, the European Association of Nuclear Medicine (EANM)32, the International Society for Magnetic Resonance in Medicine (ISMRM)33 and Cancer Research UK (CRUK)34. Their efforts have produced consensus guidelines for the acquisition and analysis of several IBs33,34,35,36,37. Many of these organizations also have highlighted the need for a detailed validation and qualification roadmap to improve IB translation38,39. Recognizing this need, representatives from CRUK and the EORTC, together with other assembled experts in radiology, cancer imaging sciences, oncology, biomarker development, biostatistics and health economics, have formulated an Imaging Biomarker Roadmap for Cancer Studies (Fig. 2). In this Consensus Statement, we outline this roadmap and identify specific considerations for IB validation and qualification, providing illustrative examples of IBs in various stages of development. From this framework, we provide 14 recommendations to accelerate the successful clinical translation of effective IBs, as well as the devalidation40,41 of IBs that lack utility.
Current uses of imaging biomarkers
An IB is a measurement derived from one or more medical images. Many IBs are used routinely in healthcare (Table 1). IBs provide readily available, cost-effective, non-invasive tools for screening, detecting tumours and serial monitoring of patients, including assessments of response to therapy and identification of therapeutic complications. IBs can enable tracking of a particular tumour repeatedly over time, can map the spatial heterogeneity within tumours, and can evaluate multiple different lesions independently within an individual.
Applications for IBs include the American College of Radiology Breast Imaging-Reporting and Data System (ACR BI-RADS) mammographic breast morphology42; clinical tumour, node, metastasis (TNM) stage43 (Box 1); objective response as defined in RECIST version 1.0 (Ref. 44) or 1.1 (Ref. 45) criteria (Box 2); bone mineral density T-score measured by dual energy X-ray absorptiometry46; and left ventricular ejection fraction47 (LVEF; Box 3). These IBs underpin patient care worldwide, although they are subject to continuing research to improve their performance. In addition, IBs could achieve companion diagnostic status, as seen with the IB [99mTc]-etarfolatide folate receptor positivity48 (FR+; Box 4), demonstrating that imaging can be necessary to guide the choice of therapy for individual patients.
IBs frequently add value in cancer research (Table 1). The use of IBs can enable the measurement of patient response to treatment before a survival benefit is observed, which can subsequently lead to early regulatory approval of new drugs49 (Box 2). IBs can indicate the presence of drug targets and target inhibition, for example by proving receptor occupancy50 (Supplementary information S3 (box)). IBs have the unique potential to provide serial non-invasive mapping of tumour status during treatment. For example, absolute values of 18F-FDG-PET maximum standardized uptake value (SUVmax) at baseline, or changes in this value observed early during treatment have been used for proof-of-mechanism and proof-of-principle51 in drug development (Box 5), or to demonstrate nonspecific responses to treatment52 (Box 5). Dynamic contrast-enhanced (DCE) CT or MRI-derived Ktrans (Ref. 53) (Box 6) and dynamic contrast- enhanced ultrasonography area under the curve (AUC) values (Supplementary information S4 (box)) have been used as IBs of pharmacodynamic (PD) changes and response to treatment54. The use of IBs has led to improved margins of radiotherapy dose delivery55 (Box 5) and surgical margins56.
Many more IBs used to measure tumour anatomy, morphology, pathophysiology, metabolism or molecular profiles are being developed in order to study the hallmarks of cancer57. Between 2004–2014, approximately 10,000 studies reported new or established IBs (Supplementary information S5 (box)), including IBs derived from new modalities (such as photoacoustic imaging58) and new techniques (such as MRI dynamic nuclear polarization59; Supplementary information S6 (box)), and new analytical approaches (such as radiomic profiling of tumours to extract multiple features60,61; Supplementary information S7 (box)), contributing to this ever-increasing number of IBs.
IBs have four key attributes. First, they are a subset of all biomarkers. Second, they can be quantitative or qualitative. Quantitative IBs (measured on an interval or ratio scale28) are used in patient care, but other measurements that fall outside this definition (for example, the ACR BI-RADS category, clinical TNM stage, or objective response) are categorical measurements and are also important IBs. In this Consensus Statement, we have deliberately included all image measurements (quantitative or categorical) that satisfy the definition of an IB, according to the current FDA–NIH Biomarker Working Group definition2. Third, IBs are derived from imaging modalities, techniques or signals, but are distinct entities; for example, the change in median apparent diffusion coefficient (ADC) is a biomarker, and is distinct from the modality (MRI), the technique (diffusion-weighted imaging) or the measured signal (free induction decay) required to generate the IB (Supplementary information S8 (figure)). Fourth, one imaging measurement can support multiple distinct IBs. For example, 18F-FDG-PET has improved disease staging in patients with non-small-cell lung cancer, lymphoma or melanoma by facilitating the identification of nodal and distant metastases62. In this case, the IB is clinical TNM stage, which is defined using CT and PET data rather than using CT data alone (Box 1). Similarly, systems such as PERCIST (for all solid tumours)63 and the Deauville five point system (for lymphoma)64 aim to improve the assessment of response criteria by incorporating PET data, but produce an ordered categorical IB — namely objective response (Box 2). This approach is conceptually different from quantifying absolute values of 18F-FDG-PET data to derive putative cut-off points subsequently used to identify patients with poor prognosis65 or evidence of specific pathway modulation in a clinical trial of an investigational new drug66,67 (Box 5).
Many quantitative image measurements comprise continuous data, which must then be categorized to facilitate clinical decision-making. Clinicians often decide between two or more alternative treatment options for each patient, informed by whether a biomarker value is above or below a cut-off point. For example, cancer therapy-related cardiac dysfunction has been defined as LVEF reduction of ≥10 percentage points to a value below normal (53% for adults; Box 3). Clear guidelines detailing how scintigraphy or 2D echocardiography should be performed have been established for this IB of toxicity for use in healthcare47. In distinction, other healthcare-related IBs (such as clinical TNM stage and objective response) are measured as ordered categorical variables. In this case, the boundary between several categories is defined by cut-off points; alternatively, categories can be combined to create a single cut-off point (for example, to select, continue or stop therapy).
In research applications, data are often interpreted as a continuous rather than used in a categorized way. For example, in early phase trials, continuous variable PD biomarkers (including percentage change in tumour size68, 18F-FDG-PET SUVmax (Ref. 52), ultrasonography AUC69 and MRI median Ktrans (Ref. 53) or ADC70 (Boxes 5,6; Supplementary information S6 (box)) can indicate antitumour activity of therapeutic agents used at different doses and time points. These studies have led to the demonstration of proof-of-concept and proof-of-mechanism, the definition of pharmacokinetic (PK)–PD relationships and informed on dose selection for novel therapeutic approaches53.
Specific considerations for IBs
IBs and biospecimen-derived biomarkers differ in several important aspects, limiting the relevance to IBs of previous roadmaps (designed for biospecimen-derived biomarkers13,71; Supplementary information S2 (table)). The performance of imaging devices of different makes and models installed in different clinical centres can vary considerably. These devices are designed, approved, maintained, and operated to provide images38 that diagnostic radiologists, nuclear medicine physicians and other clinicians interpret, often with little need to quantify the data obtained. Innovation is largely driven by competition to improve image quality and the user interface; vendors and purchasers often have only secondary interest in how improvements in image quality will affect quantification or standardization of IBs. The measurement of many IBs requires the administration of tracers or contrast agents (for example, diagnostic drugs, usually investigational or off-label), the development and availability of which are uncertain. Many steps in IB validation require a new prospective clinical trial; whenever investigational tracers or contrast agents are involved, the high burden of regulation can cause significant delays in this process.
Many IBs do not purport, even in principle, to measure any underlying analyte, making traditional approaches to assay validation seen with biospecimen-derived biomarkers problematic. For example, dilution linearity and reagent stability — important performance characteristics for biospecimen-derived biomarker assessment12 — cannot be assessed for biomarkers based on analysing medical images, such as texture-based IBs. The measurement validity of such IBs depends substantially on events that occur while the patient is coupled to the imaging device. Thus, the role of central reading laboratories in data processing is minor in comparison with biospecimen-derived biomarkers4.
The imaging biomarker roadmap
Biomarkers cross the two translational gaps by passing through a series of domains (discovery, or domain 1; validation, or domain 2; and qualification with ongoing technical validation, or domain 3) that address different research questions. This process follows two parallel and complementary tracks: in one track, technical (assay) performance is examined by addressing whether the biomarker is measurable precisely and accurately, and whether it is widely available in all geographical territories. In the other track, biological and clinical performances are examined by addressing whether the biomarker can be used to measure a relevant aspect of biology or predict clinical outcome. In reality, no biomarker is perfectly validated. Instead, strategies must be defined to identify, mitigate and quantify the uncertainty, risk and cost associated with any given biomarker in making research or clinical decisions72,73,74. The nature of these activities and the sequence in which they are combined constitutes a 'biomarker roadmap'.
In biospecimen-derived biomarker roadmaps, technical (assay) validation commonly occurs early in order to produce a 'locked-down biomarker' — that is, the data acquisition and analysis pathways used to measure the biomarker are fixed. In many cases, this stage is followed rapidly by biological and clinical validation because the locked-down biomarker enables widespread evaluation across multiple sites4,18. For IBs, several key aspects of technical validation (such as multicentre reproducibility) must be addressed at a relatively late stage, unlike biospecimen-derived biomarkers. Studies might address both technical and biological validation concurrently, but progress down each track might be quite independent. As evidence for validation accumulates, a third track of cost-effectiveness must be considered, because IBs must not only demonstrate an association with health benefits, but also demonstrate 'value for money' when compared with the use of clinical information alone or with alternative biofluid-based in vitro diagnostics75.
Imaging biomarker discovery — domain 1
Most biospecimen-derived biomarkers are molecular features found in the genome, transcriptome, proteome or metabolome that can be chosen rationally to address unmet needs in cancer medicine. This selection approach has also inspired the rational design of several novel targeted nuclear medicine or optical imaging tracers76. Many of the most valuable IBs, however, have had a convoluted genesis through unanticipated discoveries in the physical sciences (chemistry, computing, engineering, mathematics or physics) being matched to unmet medical needs after initial discovery, and then developed further.
Technical validation — domains 2 and 3
Complete technical validation is achieved when an IB measurement can be performed in any geographical location, whenever needed, and give comparable data. Technical validation does not address whether the IB measures underlying biology, or relates to clinical outcome and/or utility, but it can have a major influence on subsequent attempts at qualification because IBs with limited availability or poor reproducibility cannot sensibly undergo further clinical evaluation in large multicentre trials.
Precision. Repeatability and reproducibility are related measures of assay precision. Repeatability refers to measurements performed multiple times in the same subject (in vitro and/or in vivo) using the same equipment, software and operators over a short timeframe, whereas reproducibility refers to measurements performed using different equipment, different software or operators, or at different sites and times77, either in the same or in different subjects.
Different intended applications require different levels of evidence for precision3. For example, single- centre repeatability might be sufficient to validate a PD or monitoring IB in an early phase clinical trial restricted to that site53. Conversely, screening, diagnostic, prognostic and predictive IBs with putative use in whole populations require evidence of reproducibility across multiple expert and non-expert centres before they can be considered technically valid. Such multicentre validation requires complex and costly studies13.
Repeatability analysis should be performed early in IB development. Repeatability estimated from studies using rodents can seem unfavourable because, in some models, tumours can grow considerably within a few hours78, but useful data can be gained in slow-growing rodent models. Definitive repeatability analysis is best assessed in studies with humans, and should be performed at each centre that evaluates the IB anew, because reliance on historical or literature values is a source of error. Test performance is known to vary between sites, and is influenced by scanner performance, and organ site studied (for example, precision is better in brain lesions79 compared with liver or bowel lesions80,81,82). Physiological variations between individuals (depending on factors such as caffeine, nicotine, alcohol or concomitant medication) can also alter IB values83.
Multicentre studies involve different research institutions that usually utilize devices supplied by different vendors. These devices are broadly equivalent for clinical radiology purposes, although they often have important hidden differences that affect IB acquisition and analysis (Supplementary information S2 (table)). These factors do not preclude multicentre technical assessment of IB precision, but values might not be as reproducible as those derived in few-centre or single-centre studies84 (Box 6; Supplementary information S4 (box)).
Unfortunately, despite the scientific benefits associated with double baseline studies, the time and financial cost of performing such studies often deters investigators and funders from incorporating repeatability or reproducibility evaluations into study protocols. For example, between 2002–2012, only 12 of 86 phase I/II studies of antivascular agents that incorporated DCE-MRI biomarkers, such as the volume transfer co-efficient Ktrans, measured test–retest performance53 (Box 6). Multisite reproducibility studies of IBs are very rare.
Bias. Technical bias describes the systematic difference between measurements of a parameter and its real value85. Few in-human IB studies report bias because real values might not be possible to ascertain in a clinical setting. For some IBs, bias can be estimated by comparison with reference phantoms, as is the case when validating biomarkers on the basis of CT Hounsfield units86, and MRI longitudinal (R1) and transverse (R2) relaxation rates87. Imaging phantoms, however, seldom fully represent technical performance in animals or humans. For other IBs, such as the effective transverse relaxation rate (R2*), diffusion anisotropy, or Ktrans (all measurable by MRI), appropriate calibration phantoms are not available or poorly represent living tissue.
Availability. IB assessments must be feasible, safe, and well-tolerated in the target population, and must have regulatory and ethical approval for use in humans. Differing regulatory approvals (for example, different gadolinium-based contrast agents have been approved by regulatory agencies in Europe, North America and Japan13) and commercial pressures (for example, ferucarbotran iron oxide nanoparticles are no longer commercially available in Europe or North America) can substantially affect availability. With regard to PET, some short half-life tracers (such as 15O-H2O) are highly informative, but require the presence of on-site cyclotron facilities, making worldwide translation impossible with currently available technologies, for economic reasons. Finally, specialist techniques in all modalities require advanced radiographer and technician support, which might not be available locally.
Biological/clinical validation — domain 2
The terms 'biological validation', 'clinical validation' and 'clinical utility' describe the stepwise linking of biomarkers to tumour biology, outcome variables and value in guiding decision-making, respectively. Clinical validation requires demonstration that the biomarker merely relates to a clinical variable and, therefore, is less difficult to establish than clinical utility (but also less meaningful)86. Clinical utility is achieved when the biomarker leads to net improvement of health outcomes or provides information useful for diagnosis, treatment, management, or prevention of a disease10,88.
The REMARK guidelines89 provide a framework for the assessment of clinical utility and validation. Typically, prospective testing of IBs in clinical populations is required in adequately powered studies with follow-up times of 3–5 years to provide outcome data associated with the IB. For example, in patients with colorectal cancer, parameters such as circumferential resection margin56, depth of tumour spread, and extramural vascular invasion90, are assessed by MRI before surgery, have been validated as preoperative prognostic and predictive IBs, and are currently used to stratify patients into treatment groups. Similarly, ongoing work is evaluating the role of tumour CT, MRI and PET 'radiomic signatures' (Ref. 91) (Supplementary information S7 (box)) or 18F-FDG SUVmax baseline values65 as prognostic IBs. In these examples, IB data are derived from routine clinical images, making the process of establishing clinical utility much faster than is typically possible for IBs derived from techniques not yet established in routine healthcare61.
IBs that have demonstrated clinical validity can have important roles in drug development. For example, RECIST 1.1-defined objective response is a widely used phase II clinical trial end point that has proved useful for preliminary screening of new therapeutic agents45. More-rigorous criteria must be met to validate an IB as a trial-level surrogate end point to completely replace a well-established clinical trial end point. The Prentice criteria92 describe idealized conditions for demonstration of surrogacy, which unfortunately are rarely achieved and would typically require extremely large datasets93. Approaches for surrogate validation more pragmatic than the Prentice criteria involve the application of meta-analysis methods to a collection of large or moderate-sized datasets that measure both the putative surrogate biomarker and the definitive clinical end point94. These meta-analytic approaches often require some degree of IB standardization in order to combine datasets in a meaningful manner. Some IBs have been well-standardized (such as the RECIST 1.1 objective response), whereas others (for example, DCE-MRI-derived Ktrans53; Box 6) have not. Large prospective multicentre studies relating IBs to an outcome can only be initiated when exhaustive technical validation has established multicentre IB precision and accuracy. Clinical utility (and sometimes, validation) necessarily happens late in the IB roadmap; therefore, alternative validation strategies must be sought for many novel IBs. Biological validation can be approached by accumulating a platform of evidence linking the IB to meaningful biological features, and response to therapies with well-studied mechanisms of action71. These principles — outlined by Bradford Hill93,95 (Supplementary information S9 (box)) — establish evidence for IB performance on the basis of scientific coherence, specificity, strength of association, effect gradient, temporality and consistency.
If the Bradford Hill criteria are adopted, early preclinical imaging–pathology correlation studies5,96 provide an important component of the platform of evidence because whole-tumour histopathology is rarely possible in patients. Studies conducted with animals enable clinically relevant IBs to be related to fundamental biological processes that can only be measured with invasive techniques. For example, the relationship between DCE-MRI-derived Ktrans and tumour blood flow rate (a fundamental measure of response to antivascular agents) was investigated using a terminal radiotracer-based technique in rat tumour models97. Preclinical studies also enable the examination of the therapeutic dose–response relationship at different time points in a range of tumour models. Of note, however, pathophysiological assays are often considered as a 'reference standard', but are also biomarkers and thus imperfect, and also subject to sampling bias. Thus, some 'reference standard' tests might not enable the prediction of survival, or relate to the intended biological process. Moreover, precise pathophysiological correlates of some IBs might not exist, or can be almost impossible to obtain13.
Evidence of imaging–biology correlations and early response to therapies can provide sufficient biological validation to establish IBs as useful for PD assessments or monitoring response in early phase clinical trials, even in the absence of compelling outcome data53. The IBs 18F-FDG-PET SUVmax and Ktrans (Boxes 5,6) are illustrative of how IBs can cross translational gap 1 to guide decision-making for subsequent studies. For example, the biologically active dose for the antivascular agent ZD6126 calculated according to Ktrans measurements98 was shown to be greater than its maximum tolerated dose, effectively halting further clinical development of this agent. In other studies, Ktrans data informed of the biologically active dose for cediranib99 and the optimum scheduling for brivanib100 (Box 6).
Cost-effectiveness — domains 2 and 3
To be translated into the clinic, IBs must provide good 'value for money' and compare favourably with biospecimen-derived biomarkers resulting from technologies, such as 'liquid biopsies' of isolated circulating tumour cells or cell-free DNA. In the research setting, the value added by testing a key hypothesis (with an IB) should be greater than the cost of performing the study. In healthcare, the economic test is harsher, because even well-validated IBs will not cross translational gap 2 unless they offer an advantage in terms of cost per quality adjusted life year (QALY) gained13. Initially, translating IBs into the healthcare setting is costly and time-consuming101.
Studies in early stages of IB discovery and validation can access conventional biomedical research funding streams, whereas evidence-gathering studies to support IB qualification can be difficult to fund. For example, large-scale multicentre reproducibility studies can be unattractive to research funders if seen as 'incremental' and, unless such studies create exploitable intellectual property, they are not attractive to commercial sponsors. One approach to solve this problem is to assemble consortia of commercial and not-for-profit stakeholders, such as the QuIC-ConCePT consortium13,31, QIBA28 and QIN102, to undertake multicentre validation steps to meet the collective needs of the community.
Initial IB translation requires large-scale funding by government, charity or industry, or commercialization of the IB by an imaging company (for example, a major scanner manufacturer in the case of hyperpolarized 13C-MRI/MRS), or by a start-up business focused on a specific technique or IB. This strategy requires careful assessment of the likely risk–benefit of the development process, and strong intellectual property positions to ensure likely financial return on the imaging device, agent or biomarker38. Complex economic considerations are related to IBs developed as companion diagnostics associated with a specific therapeutic. Such IBs might be cost-effective for the healthcare payer (reducing futile expenditure and avoiding adverse effects in patients who cannot benefit from the drug) as well as beneficial for the supplier of the therapeutic (leading to a reduction in the size of trial cohorts, enabling marketing authorization, and providing better ratios of costs per QALY gained), but are associated with increased risks for the IB developer because, if clinical translation of the therapy fails (such as in the case of vintafolide48; Box 4), the market for the companion diagnostic disappears.
Qualification — domain 3
The term 'qualification' has different scientific and regulatory meanings. Generally, qualification is an evidentiary process of linking a biomarker with biological processes and clinical end points intended to establish that the biomarker is fit for a specific purpose96,103,104. For example, the use of IBs qualified as prognostic would enable the enrichment of a clinical trial population with patients at the highest risk of experiencing a clinical event, thus increasing the efficiency of the trial105. Such IBs have to be clinically validated to show their association with an outcome, but might not necessarily be associated with clinical utility for decision-making in healthcare. In distinction, the same IB could potentially be qualified for prediction of patient response (or lack of response) to a particular therapy based on evidence that the treatment effects in biomarker-positive patients differ from those in biomarker-negative patients (for example, superior outcome with experimental therapy in biomarker-positive patients versus biomarker-negative patients)106. The IB might be used for enrichment or stratification of patient populations in clinical trials and, if the trial is successful, such IB could be further developed into a companion diagnostic with established clinical utility for identifying patients likely to benefit from the new therapy.
Qualification requires demonstration of fitness for a particular purpose and, therefore, an IB might need to be evaluated in several scenarios to justify a broad claim of qualification. Hence, the IB might become qualified for one particular use (such as for a cancer type or a certain drug class), but not for others. Nevertheless, when a biomarker is qualified in multiple settings (for example, different tumour types, different therapies or different research questions), then the process of qualification for a new application can be expedited because most of the necessary validation requirements are likely to have been fulfilled107.
'Regulatory qualification' is a more-specific term that describes a framework for the evaluation and acceptance of a biomarker for specific use in regulatory decision-making108 (Supplementary information S1 (table)). Examples include new drug approvals or safety monitoring. This framework for regulatory qualification is not a formal requirement for a drug developer who merely wishes to 'qualify' an IB to support a proof-of-mechanism or a dose-selection decision.
With this Consensus Statement we aim to accelerate IB translation. For this purpose, we have produced 14 key recommendations (Box 7), accompanied by a detailed imaging biomarker roadmap for cancer studies (Fig. 2). Together, the roadmap and recommendations provide a framework for understanding how to examine qualitative and quantitative IBs at all stages of their validation and qualification. Herein, we identify the key steps required to achieve these goals, recognizing the important differences between IBs and biospecimen-derived biomarkers. No biomarker is expected to be perfect; instead, this roadmap provides a tool to assess the current evidence for technical and biological and/or clinical performance of any given IB at any stage of development. The limitations of each IB can be identified, quantified, documented and made publicly available. In the process of creating this roadmap, we have identified several potential obstacles in the clinical translation of putative IBs.
Recommendations 1–2 — grant submissions and study publications. Proposals for funding to support IB-related studies should state clearly how these will advance IB validation or qualification. Resulting journal publications should state explicitly how these aims have been achieved (recommendation 1). Study design, protocols, quality assurance processes and standard operating procedures should be reported exhaustively by making full use of supplementary materials for research publications. The software used for image processing should also be reported and be made available once intellectual property rights have been addressed (recommendation 2). These two recommendations will result in the greatest possible confidence in the reported IB-related data, will facilitate the conduct of statistically valid meta-analysis studies of imaging data, and will help investigators, reviewers, target audiences, funders and governments evaluate the risks associated with each IB for any given research or healthcare application.
Recommendations 3–7 — technical (assay) validation. A compelling rationale supports the accreditation of clinical imaging laboratories as being competent for measuring a given IB (recommendation 3), in line with the standards set by the biospecimen-derived biomarker community4. This approach has been adopted by the PET community when performing quantitative 18F-FDG-PET in clinical trials both in Europe and North America, focusing on performing regular equipment calibration and quality assurance. Moreover, data acquisition, analysis and reporting standards have been adopted32. In Europe, this initiative has been driven by the EANM and endorsed by the EORTC to ensure that institutions involved in multicentre clinical trials adopt best-practice procedures, and that quantitative reporting is harmonized across sites to improve reproducibility. Participants receive certification to distinguish them from nonparticipating institutions. Similarly, standardization of DCE-ultrasonography109 and LVEF measurements47 have been addressed by expert consensus. In the USA, the NCI evaluates performance of dynamic and static PET, volumetric CT and DCE-MRI in partnership with the ACRIN29. All NCI-designated cancer centres have been certified for measurement value equivalence, with ongoing annual inspection to provide regulation39. This certification involves the incorporation of IBs in clinical trial design and requires evaluation of the performance of imaging sites by the NCI Clinical Trials Network26.
Site accreditation is an important step towards improving technical performance in multicentre studies, but must reflect widely sought academic consensus, become adopted by international societies, and receive the backing of funders, industry and regulators for such accreditation to have value. Accreditation must simultaneously promote standardization and harmonization of IBs for multicentre use, while accommodating studies led by investigators who have scientific freedom to further develop and optimize IBs. To achieve this optimization, best-practice guidelines for each widely used IB (or related family of IBs) must be updated and reviewed regularly (recommendation 4), because biomarker drift is inevitable owing to technological advances in scanner performance (for example, clinical trials with ongoing data collection can be affected by hardware and software upgrades). Suitable statistical methods, such as multivariate linear regression analysis and other more complex statistical approaches, must be used to adjust for changes in IBs during ongoing studies (for example, defining pre-change and post-change data).
IB precision must be demonstrated early in IB development through single-centre repeatability studies, or few-site reproducibility studies (recommendation 5). This assessment is particularly important when testing the ability of an IB to measure the effect of therapeutic intervention. The choice of performance metric is important; for example, the coefficient of variation assumes that the standard deviation is approximately proportional to the mean. If studies of repeatability are performed under multiple conditions (such as imaging patients across different groups), then a plot of the standard deviation versus mean should be examined to determine whether this proportionality is valid. In most imaging studies of repeatability, only two replicates are acquired for each patient and thus, a Bland Altman plot will enable the evaluation of whether or not the mean of the measurements influence on variance, providing overall limits of agreement110.
Once the IB is shown to have sufficient technical and biological validity to cross translational gap 2, multicentre reproducibility must be evaluated (recommendation 6). Some studies have measured IB multicentre reproducibility (for example, DCE-CT evaluation of ovarian cancer111), but these are rare exceptions. The recruitment of patients with cancer to attend for scanning on devices at multiple centres is seldom possible and thus, studies usually require the use of data from different patients, scanned on different machines. Considerable centre-specific differences might exist regarding devices, contrast agents and tracers, and software. This variability must be accounted for when considering multicentre reproducibility. Mixed-effects modelling provides a statistically robust approach to maximising data inclusion, while acknowledging inevitable slight inconsistencies in the data112.
Data-analysis strategies must be developed for multicentre studies (recommendation 7). Analysis led by one central site can reduce data variation for studies with a moderate number of participating sites113,114. As IBs transition towards crossing gap 2, however, using one central site is inappropriate for IBs that will eventually be analysed at many cancer centres once the IB has been adopted in healthcare. To facilitate this transition, sites should compare their own technical performance against a central analysis, similarly to the assessment of objective responses in oncology trials.
Recommendations 8–11 — biological validation and clinical validation. Clinical validation occurs relatively late in the development process for most IBs compared with biospecimen-derived biomarkers71. Extensive preclinical studies can provide well-controlled data to examine the relationship of the IB to pathology, and the IB to the effects of interventions, and therefore are strongly encouraged (recommendation 8). The choice of experimental model is an important consideration. Tumour xenograft models in immunodeficient mice are well-studied and have reproducible growth characteristics115, and can be ideal for initial IB development, but tumour models that better portray the relevant biological characteristics found in human cancers are also needed116,117. Firstly, in situ tumours, or those implanted into their orthotopic site often better recapitulate the local microenvironment of human tumours117,118. Secondly, syngeneic models, with intact immune systems, are essential for some studies (for example, for the evaluation of IBs for immunotherapies). Thirdly, appropriately genetically-engineered mouse models can have lesions that accurately mimic human tumours119, an approach that can facilitate co-clinical trials in which preclinical studies are run in parallel with clinical trials in order to identify likely responders to targeted therapies118,120. Finally, models derived from patient tissue (PDX models)121 or circulating tumour cells (CDX models)122 offer potential insights into developing personalized therapeutic regimens123. The biological validation of IBs must incorporate the use of these models once proof-of-concept has been demonstrated in xenograft models. When possible, IBs should then be validated in clinical studies124, in order to confirm imaging–biology relationships in humans. Adaptive trial designs can be useful for early stage IB studies125. Sample-size re-estimation can be performed in those studies in which limited relevant data inform on sample size126. Similarly, group- sequential design127 can provide flexibility in the number of animals or patients entered into a study. Such approaches can ensure adequate power with small sample sizes — if effect sizes are suitably large — and, therefore, can make imaging studies more affordable128.
Improved methods for imaging–biology correlation are needed for IB biological validation (recommendation 9), because currently available methods often fail to account for extensive spatial heterogeneity in the IB129 and in the tissue pathology118, and for the difference in scale of the two measurements. Moderate-level concordance between imaging data (slice or volume) and pathology (single section) provide only limited biological validation. Instead, 3D comparisons are strongly encouraged, to better co-localise imaging and pathology data and reduce sampling bias. In some studies, genomic, transcriptomic and proteomic biomarkers, or biofluid-derived biomarkers, including circulating tumour cells, might be more-appropriate validation tools than tissue-derived measurements91,130. The use of appropriate statistical methods is required to define associations between imaging and pathology measurements. Spearman's rho or Kendall's tau rank correlation coefficients enable the comparison of IB values to reference standard pathology readouts within a cohort; however, if these two measurements (or transformations of these variables, such as logarithms) relate linearly to one another, regression models provide estimates of biases and the scale of the relationship.
IB studies often generate rich datasets that can be collected and banked for future re-use. These data include raw images, processed images and parameter maps, ancillary files (such as regions of interest, mask files, and stored header information detailing scan parameter settings), and essential patient metadata covering demographics, treatment history and clinical outcome (such as response, progression-free survival and/or overall survival). Useful data-archiving systems require financial support; collaboration between multiple academic, industry and funding partners; and ongoing curation and active management — as exemplified by the NCI Informatics Technology for Cancer Research platform131. Technological advances (such as cloud-based solutions) must be accompanied by suitable information governance arrangements, potentially crossing international legal and regulatory frameworks102, and by commitment from funders to resource these initiatives.
Creation of animal and human cancer image repositories, following the lead of organizations including the NCI and the ESR132, is strongly encouraged (recommendation 10) to enable rapid testing of new analyses by researchers from different institutions. In some cases, this approach can even lead to a reduction in the number of animal experiments (in line with the 3Rs of animal welfare in cancer research116) or in the number of new patients recruited. Standardized data collection, analysis and archiving are required to support multisite clinical trials102.
Complete and transparent study reporting is essential to avoid selective reporting bias and publication bias (recommendation 11). Results of all prespecified analyses (secondary analyses as well as primary analyses) should be reported, regardless of whether they are consistent with expectations, or whether they achieve statistical significance133,134. Highlighted exploratory analyses (either prespecified or post hoc) should be accompanied by a description of all other exploratory analyses performed. This strategy will more-accurately reflect the potential for false-positive and false-negative findings. Adherence to these principles of reporting will help to eliminate the distortion of research findings resulting from selective reporting and failure to publish negative studies. These biases can also be reduced by registering studies and detailing their hypotheses on publically available websites (such as ClinicalTrials.gov) before the study is initiated or before outcome data are unblinded89,135.
Recommendation 12 — qualification. Robust study design and predefined statistical analysis plans are vital to ensuring that the highest quality evidence is available to qualify IBs. Late-stage multicentre clinical trials should be powered adequately to demonstrate clinically useful effects. At present, few IBs are rigorously validated or qualified as prognostic for quality of life, progression-free survival or overall survival136. Appropriate statistical methods (for example, control of the false-discovery rate and cross-validation techniques137) are needed to avoid spurious findings and overfitting caused by measurement of large numbers of image parameters relative to numbers of patients — a common problem in several published biospecimen-based biomarker studies138. Efficient study designs should be pursued, for example, in studies of diagnostic accuracy135 or test impact, in which each patient is regarded as his or her own control; examples include adding functional imaging to standard anatomical radiology in order to improve diagnosis139,140. In later-stage clinical trials, several treatments can be tested concurrently in the same trial using enrichment or predictive IBs, as is the case of current molecular profiling-based studies141. This measure can increase recruitment efficiency, reduce costs, and accelerate achievement of primary end points142.
Qualification requires rigorous and detailed statistical reporting standards. Screening and diagnostic accuracy IB studies should report sensitivity and specificity, results of receiver operating characteristic (ROC) curve analysis, and negative and/or positive predictive values135, whereas prognostic and predictive IB studies should report estimated effect sizes, such as hazard ratios89. Estimates should be accompanied by measures of uncertainty (for example, 95% confidence intervals) as well as statistical significance. In studies in which predictive biomarkers are evaluated using randomized controlled trials, CONSORT guidelines should be followed143.
Recommendations 13–14 — cost-effectiveness. New models for funding and regulation should be developed for investigational devices, tracers and contrast agents, and software that have value as IBs in the research setting, but lack commercial viability as a diagnostic product in healthcare systems (recommendation 13). Imaging studies are perceived to be expensive and thus, integration of investigational IBs should be linked to existing radiological tests (addition of the IB to a clinically approved ultrasonography, CT or MRI examination, for example) whenever possible. Large-scale studies should include health-economic considerations, including measuring the cost-effectiveness of IBs versus competitor tests (other IBs or biospecimen-derived biomarkers)144 (recommendation 14).
Clinical imaging has transformed contemporary medicine145. IBs have enormous potential to facilitate further advances in cancer research and oncology practice by accurately informing clinical decision-making, but must undergo rigorous scrutiny through validation and qualification to achieve this end13. This process of evaluation enables investigators and consumers to make informed decisions about IB translation for each research and healthcare application10. The roadmap and recommendations that we present herein will, if adopted, mark a change in the development and use of IBs in cancer research and patient management.
Development of this roadmap received support from Cancer Research UK and the Engineering and Physical Sciences Research Council (grant references A/15267, A/16463, A/16464, A/16465, A/16466 and A/18097), the EORTC Cancer Research Fund, and the Innovative Medicines Initiative Joint Undertaking (grant agreement number 115151), resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/2007-2013) and European Federation of Pharmaceutical Industries and Associations (EFPIA) companies' in kind contribution.
Definitions of key terminology
Important factors to consider when validating and qualifying imaging biomarkers (IBs)
Receptor occupancy: pharmacodynamic IB
DCE-US AUC: pharmacodynamic and prognostic IB
Quantitative ADC: putative pharmacodynamic IB
MRI 13C-pyruvate kp: putative pharmacodynamic IB
Radiomic signature of heterogeneity: putative prognostic IB
Generating biomarkers from an imaging technique
Examples of Bradford Hill criteria for biological validity of IBs
Maximum standardized uptake value is a parameter used routinely in clinical medicine to identify and quantify avidity of tracer uptake in PET studies.
Volume transfer co-efficient describes the transendothelial transport of low molecular weight contrast agent from blood vessels into the extravascular-extracellular space by diffusion. This imaging biomarker is commonly used in studies of antivascular agents, and measures a composite of blood flow, vessel permeability and vessel surface area.
Apparent diffusion co-efficient is a commonly used imaging biomarker in studies of various therapies, including cytotoxic chemotherapy, targeted agents and radiotherapy. The biomarker is sensitive to motion of water molecules.
- Double baseline studies
Biomarker precision can be assessed by measuring on multiple occasions and calculating the repeatability or reproducibility of the parameter. Typically, this precision is achieved by measuring the biomarker twice at baseline in the absence of any treatment effects.
Integrated area under the curve at 60 seconds is a nonspecific imaging biomarker of the tumour vasculature used in studies of antivascular agents. The biomarker measures a composite of blood flow, blood volume, leakage space, vessel permeability and vessel surface area.
About this article
Nature Reviews Clinical Oncology (2018)