Post-treatment FDG PET-CT in head and neck carcinoma: comparative analysis of 4 qualitative interpretative criteria in a large patient cohort

There is no consensus regarding optimal interpretative criteria (IC) for Fluorine-18 fluorodeoxyglucose (FDG) Positron Emission Tomography – Computed Tomography (PET-CT) response assessment following (chemo)radiotherapy (CRT) for head and neck squamous cell carcinoma (HNSCC). The aim was to compare accuracy of IC (NI-RADS, Porceddu, Hopkins, Deauville) for predicting loco-regional control and progression free survival (PFS). All patients with histologically confirmed HNSCC treated at a specialist cancer centre with curative-intent non-surgical treatment who underwent baseline and response assessment FDG PET-CT between August 2008 and May 2017 were included. Metabolic response was assessed using 4 different IC harmonised into 4-point scales (complete response, indeterminate, partial response, progressive disease). IC performance metrics (sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy) were compared. Kaplan-Meier and Cox proportional hazards regression analyses were performed for survival analysis. 562 patients were included (397 oropharynx, 53 hypopharynx, 48 larynx, 64 other/unknown primary). 420 patients (75%) received CRT and 142 (25%) had radiotherapy alone. Median follow-up was 26 months (range 3–148). 156 patients (28%) progressed during follow-up. All IC were accurate for prediction of primary tumour (mean NPV 85.0% (84.6–85.3), PPV 85.0% (82.5–92.3), accuracy 84.9% (84.2–86.0)) and nodal outcome (mean NPV 85.6% (84.1–86.6), PPV 94.7% (93.8–95.1), accuracy 86.8% (85.6–88.0)). Number of indeterminate scores for NI-RADS, Porceddu, Deauville and Hopkins were 91, 25, 20, 13 and 55, 70, 18 and 3 for primary tumour and nodes respectively. PPV was significantly reduced for indeterminate uptake across all IC (mean PPV primary tumour 36%, nodes 48%). Survival analyses showed significant differences in PFS between response categories classified by each of the four IC (p <0.001). All four IC have similar diagnostic performance characteristics although Porceddu and Deauville scores offered the best trade off of minimising indeterminate outcomes whilst maintaining a high NPV.

Fluorine-18 fluorodeoxyglucose (FDG) positron emission tomography -computed tomography (PET-CT) is central to characterising loco-regional and distant disease at initial staging and has an increasing role in post-treatment response assessment 4 . Randomised controlled trial data has shown that PET-CT performed post CRT is an accurate and cost-effective technique for assessing response and can spare 80% of patients from unnecessary neck dissection 5 . Post-treatment related changes in the neck can make assessment difficult in some cases, with evidence suggesting that human papilloma virus(HPV)-positive HNSCC behaves differently to HPV-negative disease, the specific test characteristics of PET-CT for assessing treatment response in HPV-negative HNSCC remains unclear 5,6 .
Semi-quantitative methods of treatment response assessment using standardised uptake value (SUV) have not been shown to be accurate at predicting patient outcome which has led to the development of more reproducible qualitative interpretative criteria (IC) to assess post-treatment response [7][8][9][10][11] . Heterogeneity in criteria used for assessment also limits comparison between different response assessment studies.
More recently, qualitative IC such as the Porceddu, Hopkins and Deauville scoring systems (Table 1) have been developed and validated in HNSCC response assessment [12][13][14] . These rely on visual inspection of the relative difference in tumour metabolism compared to surrounding normal tissue and/or background uptake, which in the case of Hopkins is the internal jugular vein and in Deauville, is the mediastinal blood pool. Both Hopkins and Deauville criteria use 5-point scales, however scores 1 and 2 in both categories effectively represent a complete metabolic response. Porceddu criteria employ a 3-point scale which classifies scans as positive, negative or equivocal based on whether there is FDG activity greater than adjacent normal tissues and/or liver 12 .
Several studies have reported that qualitative assessment methods are useful for predicting regional control and can help minimise the number of equivocal scan results 13,15,16 . In 2016, the American College of Radiology convened a Neck Imaging Reporting and Data Systems (NI-RADS) Committee who have developed a template to help distinguish benign post-treatment changes and residual or recurrent tumour 17 . Currently there is no clear consensus regarding the optimal IC to use in this clinical scenario. Classifying 'equivocal' cases varies depending on which IC is used and differences remain in how these patients are subsequently managed, for example, undergoing invasive neck dissection or further follow-up imaging and clinical examination given the difficulty in differentiating a benign post-treatment response or residual/recurrent tumour 18 .
The primary objective of this study was to assess comparative accuracy and prognostic ability of the 4 different IC (NI-RADS, Porceddu, Hopkins and Deauville) in a large cohort of HNSCC patients treated with curative-intent (chemo)radiotherapy for predicting local and regional disease control and progression free survival (PFS).
Methods patient cohort. The study involved retrospective analysis of a prospective database performed under a waiver of informed consent and ethics approval by the Institutional Review Board. Prospective consent was obtained from all patients for use of their PET-CT imaging data in research and service development projects. Consecutive patients with histologically confirmed HNSCC treated at a tertiary referral centre between August 2008 and May 2017 with curative-intent non-surgical treatment (radiotherapy alone or chemoradiotherapy) who had undergone baseline and response assessment FDG PET-CT. Our institutional protocol is for response assessment PET-CT to be performed approximately 4 months after treatment. Demographics, baseline characteristics, staging, treatment and outcome details were retrieved from the institutional electronic patient record (PPM+, Leeds, United Kingdom). Exclusion criteria included: patients with nasopharyngeal carcinoma; previous resection of primary or nodal disease; prior radiotherapy; FDG PET-CT only performed at baseline or for response assessment treatment. Patients were treated with either three-dimensional (3D)-conformal radiotherapy or intensity-modulated radiotherapy (IMRT), which was gradually introduced into routine clinical practice from 2010. The 3D-conformal radiotherapy technique 19 and IMRT 20 have been previously described. Institutional protocols were followed with a radical treatment dose of 70 Gy in 35 fractions over 7 weeks or 65 Gy in 30 fractions over 6 weeks, with lower doses to prophylactic dose regions (54-63 Gy in 35 fractions over 7 weeks).
Induction chemotherapy with docetaxel, cisplatin and 5-fluorouracil (TPF) or cisplatin and 5-fluorouracil (PF) were delivered to a proportion of patients as previously described 21 . Concurrent chemotherapy routinely consisted of cisplatin 100 mg m −2 at days 1 and 29.
Response assessment and follow-up. Tumour response was routinely assessed by clinical examination, naso-endoscopy where appropriate and FDG PET-CT approximately 4 months after completing treatment. Examination under anaesthetic and biopsies were performed at clinical discretion following response assessment. In general, patients who achieved a complete metabolic response did not undergo biopsy. Patients with less than a complete response were managed on an individual basis based upon discussion at a multidisciplinary team meeting. Subsequently, patients were followed up with physical examination and flexible endoscopy every 6-8 weeks in the first year after treatment, every 3 months for an additional 2 years and every 6 months until discharge at 5 years 22 . pet-ct technique. FDG PET-CT examinations prior to June 2010 were performed on a 16-slice Discovery STE PET-CT scanner (GE Healthcare, Chicago, Illinois, USA) and from June 2010 to October 2015 on a 64-slice Gemini TF64 scanner (Philips Healthcare, Best, Netherlands), After October 2015 all scans were performed on a 64-slice Discovery 710 scanner (GE Healthcare, Chicago, Illinois, USA). Serum blood glucose was routinely checked and if >10 mmol/L scanning was not performed. Patients fasted for 6 hours prior to intravenous Fluorine-18 FDG injection (dose varied according to patient body weight). PET acquisition from skull vertex to upper thighs was performed 60 minutes after tracer injection. A silence protocol was employed in the uptake period following tracer injection to minimize physiological tracer activity within the head and neck region. The CT component was performed according to a standardized protocol (without the use of iodinated contrast medium) with the following settings: 120 kV; auto-modulated mAs; tube rotation time, 0.5 seconds per rotation; pitch, 6; section thickness, 2.5 mm (to match the PET section thickness).
Patients maintained normal shallow respiration during the CT acquisition. Images were reconstructed using a standard ordered subset expectation maximization (OSEM) algorithm with CT for attenuation correction. Both non-attenuation-corrected and attenuation corrected datasets were reconstructed. image analysis. All response assessment PET-CT studies were evaluated by a trainee radiologist under supervision of a dual-accredited Radiologist & Nuclear Medicine Physician with 15 years' experience of reporting oncological PET-CT using specialised software (Advantage Windows Version 4.5, GE Healthcare, Chicago, Illinois, USA) and each of the four IC were applied. To accurately compare all four response assessment scales, each scale was re-classified into a 4-point scale as shown in Table 2 with complete response, partial response, indeterminate and progressive disease categories. Representative examples of these 4 categories are shown in Fig. 1. clinical follow-up. Follow-up was defined from final fraction of radiotherapy treatment. Disease status post-treatment was determined from pathology and/or radiology correlation with review of electronic patient records for clinical outcome. In patients who did not receive a biopsy/surgical intervention, serial negative physical examinations over the follow-up period and any relevant imaging investigations were used as confirmation of disease-free status. www.nature.com/scientificreports www.nature.com/scientificreports/ Statistical analysis. Survival and recurrence time was defined from final fraction of radiotherapy treatment. Diagnostic performance metrics for each IC: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and overall accuracy applied to both primary tumour and nodes were calculated. Performance in sub-groups including HPV-positive oropharyngeal cancers (OPC), HPV-negative OPC and hypopharynx/larynx cancers were analysed.
Univariate association between recurrence (local and/or regional and/or distant) and each adjusted response assessment score (1)(2)(3)(4) was estimated by the Chi-squared test. Kaplan-Meier analysis and Cox proportional hazards regression analyses were performed for each IC to assess cumulative progression free survival (PFS) and time to death (overall survival, OS) or progression. Log-rank testing was used to compare survival between the response categories within each IC. Receiver-operating characteristic (ROC) curve analysis was performed for  Table 2. Harmonisation process of each interpretative criteria into standardized 4-point scales. outcomes. Median follow-up period was 26 months (range 3-148 months). Median time from end of treatment to response assessment PET-CT was 17 weeks (range 6-31 weeks). 2-year survival outcomes were as follows: PFS 73%; OS 79%; local PFS 89%; regional PFS 85%; distant PFS 88%. 130 patients (23%) died in the study period with 432 patients (77%) alive at the time of analysis. 13 patients (2%) died within 6 months of treatment; one from a sudden cardiac event, two from tumor haemorrhage and 10 from disease progression. During follow-up, 156 patients (28%) developed progressive disease, 31 (20%) at the primary tumour site only (local failure), 42 (27%) at a regional nodal site only (regional failure), 16 (10%) at both the primary tumour and nodal site (loco-regional failure) without distant metastases and 35 (22%) had distant metastases only. 32 patients (21%) had local and/or regional failure with distant metastases. 11 cases (7%) of progressive disease were biopsy proven, 144 (92%) were based on radiology and 1 was a clinical diagnosis. 22  www.nature.com/scientificreports www.nature.com/scientificreports/ of 35 patients who developed distant metastases had these detected on response assessment PET-CT, 13 patients developed metastatic disease subsequently. Median time to loco-regional recurrence was 4 months (range 2-53).
Kaplan-Meier, the log-rank test and Cox proportional hazards regression analyses showed significant differences in PFS and OS between response categories classified by each of the four IC (p < 0.0001). Pairwise log-rank results provided in supplementary information. The survival curves pre and post harmonisation are shown in Figs. 2 and 3. indeterminate cases. The number of indeterminate scores varied for each IC as shown in Table 4. With regards to primary tumour, NI-RADS classified 91 patients as indeterminate compared to 25 for Porceddu, 20 for Deauville and 13 for Hopkins. Overall, the NI-RADS IC scored more cases than the other 3 IC combined as indeterminate i.e. equivocal. Hopkins scored the fewest number of indeterminate cases.

Diagnostic performance of interpretation criteria. The diagnostic performance of each IC in predict-
ing disease control with regard to primary tumour, nodal disease, HPV-positive OPC, HPV-negative OPC and hypopharynx/larynx sub-groups are displayed in Table 5. The performance of each IC in predicting complete response and progressive disease in the indeterminate groups is shown in Table 6.
The ROC analysis (Fig. 4) established that each of the IC were similar in their ability to predict disease outcome with areas under the curve (AUC) of 0.76 (NI-RADS), 0.76 (Porceddu), 0.75 (Hopkins) and 0.76 (Deauville) respectively.

Discussion
The use of qualitative assessment of FDG PET-CT post treatment in HNSCC was highly predictive of PFS and OS using four previously validated criteria -NI-RADS, Porceddu, Hopkins and Deauville in our large patient cohort. All 4 adjusted IC demonstrated good discriminatory ability in predicting disease outcome with high specificity, PPV and NPV which could help clinical decision making, stratifying patients into different management streams including continued observation, biopsy or salvage surgery. www.nature.com/scientificreports www.nature.com/scientificreports/ Compared to the existing literature, the PPV values of our study (83-95%) are slightly higher than other reported rates of 51-78% 14,15,23 . Diagnostic accuracy of response assessment PET-CT is affected by the time interval between treatment and follow-up imaging, the later median time-point of imaging post radiotherapy (17 weeks) compared to other studies may account for the slightly higher PPV values in this study. Conversely the NPV is lower (84-86%) compared to multiple other studies (86-97%) with smaller cohort sizes (largest 214 patients) 12,13,[23][24][25] . PET-CT was categorised as false-negative if recurrent cancer was diagnosed at any stage during follow-up, the longest time to progression recorded was over 50 months from the end of treatment, whereas other studies limited this period to 6 months after the response assessment PET (14 23 , and had a higher NPV. A comparable study assessing Deauville criteria for nodal response assessment post CRT in 105 HNSCC patients using the same methodology for false-negatives (any time during follow-up) had a similar NPV (86.4%) (13). By restricting false negatives to those with recurrence developing within 6 months, the NPV of NI-RADS as an example, increases from 85% to 94% in our cohort.   www.nature.com/scientificreports www.nature.com/scientificreports/ There was greater variation in the number of cases classified as indeterminate between different IC, with far more scores in this category when applying the NI-RADS IC. This likely reflects the subjective nature of the NI-RADS indeterminate group which includes all cases which have focal mild to moderate mucosal FDG uptake without giving a reference area of uptake such as the IJV (Hopkins) or mediastinum (Deauville) thereby making it more difficult to split these cases up compared to the other IC 17,26 . The overall mean recurrence rate of 53% (range 42-69%) in NI-RADS category 2 (low suspicion for recurrence) patients in this study is also much higher than previously reported research study figures of 17.2%, highlighting that more work in large cohort studies is required to validate this 26 . One advantage identified for the Hopkins IC is the low number of indeterminate cases however the NPV was lower, particularly for HPV-positive (87.6%) and HPV-negative (77.4%) groups. Porceddu and Deauville provided the best trade off minimising indeterminate scores whilst maintaining a high NPV. Individual centres should apply one IC consistently across all patients to facilitate more standardised reporting and allow for future comparisons between institutions.
Interestingly, the NPV for HPV-positive OPC patients was higher than for the HPV-negative sub-group. Fakhry et al. previously reported that HPV-positive status was a good prognostic indicator with better CRT sensitivity and patient outcome 27 . This is relevant in indeterminate cases, where use of these IC may provide more information on guiding optimal management between neck dissection or surveillance. Previous research has demonstrated no association between HPV status and other semi-quantitative imaging markers in relation to predicting recurrence 28 . The higher NPV in HPV-positive patients may be potentially useful for clinicians when considering additional treatments such as neck dissection.
The prognostic value of PET is more uncertain when FDG uptake is equivocal/indeterminate across all four IC, with a low PPV, although this observation is limited by a relatively low number of cases fitting this sub-group with a median number of 22 for all tumour and node cases although this group was as low as one in the HPV status and hypopharynx/larynx subgroup analysis. The ability to more accurately distinguish between benign post-treatment inflammation or residual disease remains of paramount clinical importance as each scenario would require significantly different patient management. In longitudinal PET studies assessing lymphoma, equivocal scans have proved to represent a good rather than bad prognosis 29 . In the meantime, as advocated by the IC such as NI-RADS, indeterminate cases may be best followed up non-invasively with imaging in the form  www.nature.com/scientificreports www.nature.com/scientificreports/ of a contrast-enhanced CT or PET 17 . One option is to perform a second interval PET-CT response assessment. Porceddu et al. recommend a further repeat PET-CT 4-6 weeks later (16 weeks post treatment) if the first one shows indeterminate response, with no subsequent cases of nodal failure 12 . Similarly a recent publication from our group highlighted that a second-look PET-CT 13 weeks median duration from the first response assessment PET-CT (median 30 weeks post treatment) found the majority of incomplete response cases convert to a complete metabolic response 30 . Follow-up imaging at an earlier time point results in a higher number of false positive results 31 . This warrants future evaluation in a larger prospective cohort.
Inter-observer agreement of IC was not assessed in this study mainly because previous work has shown these IC to be highly reproducible 14,16 . Limitations include the retrospective study design, heterogenous patient cohort with different sites of HNSCC and the slight difference in treatment with the majority having CRT but a small group having radiotherapy only.
Emerging studies exploring the utility of radiomic features extracted from head and neck cancers highlight the potential for more accurate prediction of disease progression using novel imaging signatures which could be augmented by artificial intelligence techniques [32][33][34] . Although there is no current clinical implementation of a radiomic-based decision-support system in this clinical scenario, in the future this may emerge and could result in better patient stratification and personalization of treatment 34 . Some challenges remain ahead of this including a need for greater data transparency, multi-centre collaborations for cross-validation and to confirm reproducibility of radiomic analysis methods 34 .   Table 6. Diagnostic performance of interpretative criteria for prediction of complete response and progressive disease for indeterminate scores applied to all primary tumours, all nodal disease, HPV-positive OPC, HPV-negative OPC and hypopharynx/larynx sub-groups. () = number of indeterminate cases. Key: OPC = oropharyngeal cancer.