Introduction

Oral epithelial dysplasia (OED) is a chronic, progressive precursor epithelial disorder of the oral mucosa, characterised by abnormal maturation and stratification of the surface epithelium1. It is associated with a statistically increased risk of progression to oral squamous cell carcinoma (OSCC) which is among the topmost common cancers worldwide and has an increasing incidence and worsening prognosis2,3. Clinically, OED most commonly presents as a white patch/plaque (leukoplakia) with up to 50% of biopsied lesions showing dysplasia4 and malignant transformation rate of 9.5% [99% CI 5.9–14.00%] or 1.56% per year5. OED can also be seen in other oral potentially malignant disorders (OPMD), a group of lesions and conditions characterised by an increased risk of malignant transformation, including oral submucous fibrosis, actinic keratosis, erythroplakia and erythroleukoplakia6,7. The presence of OED in these disorders increases their risk of malignant transformation8.

At present, there are no biological or molecular markers proven to be prognostically significant (or in routine diagnostic use) for OED4. Histological grading remains the gold standard for predicting malignancy risk and is used to inform patient treatment and prognosis9. Over the years, OED grading systems have substantially evolved, and the current World Health Organisation (WHO) classification (2017) grades dysplasia based on the presence of sixteen different histological features10. The ‘severity’ of these features, both in terms of frequency and location in the epithelium, are used to classify lesions into ‘low’, ‘moderate’ and ‘high’ grades, representing an increasing risk for malignant transformation9. A recent meta-analysis showed moderate/severe OED to be associated with a greater risk of malignant transformation compared to mild OED with an odds ratio of 2.4 (99% CI 1.5–3.8)5. However, it remains unclear which lesions will progress, and which will recur, as the mechanisms for OED progression are poorly understood9. Furthermore, the histological features can individually be considered relatively non-specific, and some (or all) of the features may be seen in different grades of dysplasia, of which some lesions will transform, and others will not (irrespective of grade).

In addition to these issues, there are a number of other problems related to the current grading system11. Firstly, there is substantial subjectivity in histological interpretation between pathologists, which can result in wide inter- and intra-observer variability, with potential for an incorrect grade being assigned12. This variability can arise since individual features are ill-defined, and this is further complicated by division of the epithelium into ‘thirds’ which can be challenging. Secondly, grading does not reliably predict prognosis which means that lower grade lesions may progress to OSCC whereas higher grade lesions may remain static4,10. Thirdly, several of the established histologic features can also be seen in reactive lesions, such as the margins of ulcers or candida infections. It is accepted that a complex interaction exists between a combination of features including histological atypia, progressive molecular changes and chromosomal derangements to trigger cancer development, but the individual importance of these features in OED progression is not well established13,14.

More recently, an alternative binary grading system (low/high grade) has been proposed15. This system grades dysplasia based on the overall number of cytological and architectural changes observed, and several studies have shown its improved reproducibility, inter-observer agreement and clinical utility as compared to the WHO system15,16. Despite these improvements though, neither systems consider the importance of individual histological features, or specify which of the features (in isolation or combination) are of greatest relevance for transformation and recurrence. Some older studies have compared OPMDs that did not transform to lesions that did17, and others have linked certain histology features to a higher transformation risk18. However, conclusions from these studies should be treated with caution due to weaknesses in the proposed methodologies.

The aims of this study are twofold: first, to conduct a detailed histological assessment (and inter-observer agreement) of individual OED features to identify which were most prevalent and associated with a higher risk of malignant transformation and recurrence; second, to develop and propose feature-specific prognostic models for OED outcome prediction. To the best of our knowledge, this is the first study to explore histological feature-specific prognostic prediction of OED.

Materials/subjects and methods

Case selection, tissue preparation and conversion to digital images

A retrospective sample of sequential OED cases were retrieved between 2008 and 2013 from the Oral and Maxillofacial Pathology archive at the School of Clinical Dentistry (Sheffield, UK) using a local digital database (ethical approval: 18/WM/0335). To confirm cases which had progressed to OSCC at the same clinical site, a regional head and neck cancer (HNC) electronic records system was accessed which is a repository for HNC cases within South Yorkshire. Newly stained 4 µm Haematoxylin and Eosin (H&E) sections of the selected cases were obtained from formalin fixed paraffin embedded blocks and a digital slide scanner (Aperio CS2, Milton Keynes, UK) was used to obtain whole slide images (WSI) at x40 magnification.

Inclusion and exclusion criteria

The principal inclusion criteria were varying grades of OED retrieved from the Sheffield Oral and Maxillofacial Pathology archive with sufficient available tissue and availability of minimum five-year follow-up data. Where multiple biopsies had been taken over a period of follow-up, only the initial biopsy was selected for the study. The unit of Oral and Maxillofacial Pathology at Sheffield is a regional and national referral centre which receives referrals from a wide geographical area, however, following a confirmed tissue diagnosis any necessary treatment is provided by a local core Oral and Maxillofacial team and therefore cases treated outside this unit were by default excluded in this study. Additionally, cases were excluded if there was insufficient tissue for histological analysis, incomplete minimum follow up data or histological evidence of positive tissue margins on the subsequent excision (to avoid any bias in the recurrence data). The H&E slide and clinical records for all selected cases were reviewed by two authors (HM, SAK) to ensure the inclusion criteria was met.

Clinical data collection

Minimum five-year follow-up data was obtained from clinical notes and biopsy forms by HM. Data collection included patient demographics/characteristics (age, gender, intraoral site), histological OED grade and two main clinical outcomes of interest (time to transformation and recurrence). Transformation was defined as a dysplastic lesion which had progressed to OSCC at the same clinical site and within the follow-up period, and recurrence was defined as a dysplastic lesion which occurred again in the same clinical site following active treatment (i.e. surgical excision or laser treatment) within the follow-up period. All data was recorded by HM in a structured proforma using Microsoft Excel (2016) in an anonymised-linked format.

Histological evaluation and examiners

Three experienced oral and maxillofacial pathologists (NMI, OK, SAK) working in different international centres performed independent histological examination of the OED cohort. All pathologists were provided access to the WSIs via a cloud-based system. Each WSI was labelled with an anonymous-linked number, and all pathologists were blinded to the original diagnosis and clinical outcomes. The examiners were asked to independently assess the cases and identify which histological features amongst the WHO criteria were present and informed the diagnosis. They were also encouraged to specify any additional histological features which were considered important in influencing their diagnosis.

To determine which OED features were most prevalent, the examiners were asked to provide a binary score to record the presence (or absence) of individual features; a score of 1 was given if the feature was abundantly visible (and influenced diagnosis), and a score of 0 if the feature was absent or rare/focal. The topmost common histological OED features (as per consensus scoring) were further explored to determine feature-specific observer agreement and prognostic significance. To minimise examiner bias, no formal calibration exercises were attempted, although there was an informal discussion between the examiners to discuss their approach to this task. For consistency and to prevent double counting of similar appearing histological features, the pathologists agreed on general definitions for individual WHO features (as well as other commonly presenting features). For example, basal cell hyperplasia was considered if crowding/proliferation involved 1–2 layers of basal cells, whereas loss of epithelial stratification was considered if there was a disturbance in the organised ‘stratified’ layers of the epithelium and the layers were haphazardly organised or difficult to separate.

Finally, the original OED histological grades were independently reviewed by HM and where necessary, an updated grade was assigned. A standardised score sheet was designed in Microsoft Excel (2016) to record all examiner scoring and aid systematic analysis. All participating pathologists were clinical-academic pathologists with long-standing experience in the diagnosis of OED and OSCC.

Statistical evaluation

Statistical analyses were conducted using the Stata Statistical Software19 (Version 17, 2021). The prevalence of OED features was calculated overall and for each examiner. Observer agreement was summarised as the percentage of patients for whom all three examiners agreed, and by two chance-corrected measures (Cohen’s Kappa and Gwet’s AC), where a value of 1 denotes perfect agreement and 0 relates to no agreement beyond chance alone.

Univariate associations between pathological features and clinical outcomes (transformation and recurrence) were visualised by Kaplan–Meier curves and analysed using a Cox proportional hazards regression model with Efron’s correction for tied times. Thereafter, two prognostic models were developed in which the outcome of interest was event (transformation and recurrence) at any time. The prognostic performance of the two models were compared against each other as well as against patient/clinical characteristics (age, gender, intraoral site) and histological OED grade alone by generating the area under the receiver-operator characteristic curve (AUROC). All statistical tests were two-tailed and p < 0.05 were considered statistically significant.

Results

Characteristics of the study cohort

151 previously diagnosed cases of OED were retrieved during the study period, of which 42 were excluded due to either insufficient tissue availability or incomplete minimum five-year clinical follow up data. Amongst the patient cohort, 67 (61%) were male and 42 (39%) were female with a median age of 67 years (IQR 57–77). Breakdown based on intraoral site were as follows: tongue 44 (40%), floor of mouth 23 (21%), buccal mucosa 17 (16%), gingivae 7 (6%), soft palate 6 (6%), hard palate 6 (6%) and lower lip 6 (6%). The clinical records showed that 34 (31%) of OED lesions were clinically monitored, 70 (64%) were surgically excised and 5 (5%) were treated with laser.

Prevalence and agreement of OED features

The final study cohort (Table 1) comprised 109 OED cases which were blindly re-evaluated to confirm 34 (31%) mild, 48 (44%) moderate and 27 (25%) severe dysplasia cases. Binary grading of these cases showed 73 (67%) to be low grade and 36 (33%) as high-grade lesions. Table 2 summarises the prevalence and observer agreement for the twelve most prominent OED features that were observed as per consensus scoring. The most common features were basal cell hyperplasia (72%) and irregular surface keratin (60%). The latter feature refers to any irregularity of the keratin layer, including a corrugated, shaggy or desquamative appearance. This feature was included since all pathologists highlighted it as a prominent feature in certain cases, and at present it is not on the list of WHO criteria. The least common were verrucous surface morphology (26%), loss of epithelial cohesion (30%), lymphocytic band (34%) and dyskeratosis (34%). All other features ranged between 36% and 57%.

Table 1 Characteristics of the study cohort.
Table 2 Observer agreement for OED feature analysis.

Verrucous surface morphology had the highest agreement between pathologists (Kappa = 0.73, Gwet’s AC1 = 0.83). Gwet’s AC1 measurements were comparable for abrupt orthokeratosis (0.66), lymphocytic band (0.67) and loss of epithelial cohesion (0.69). Agreement for all other features was typically modest, with the worst agreement for hyperchromatism (Kappa and Gwet’s AC1 both = 0.32) and suprabasal mitoses (Kappa and Gwet’s AC1 both = 0.34) for which all three pathologists agreed for approximately half the patients.

OED feature-specific incidence of transformation and recurrence

Table 3 summarises feature-specific incidence for transformation and recurrence. Overall, 20 (18%) OED lesions transformed, and 27 (25%) lesions recurred following treatment. A higher incidence of transformation was seen when bulbous/drop shaped rete pegs (30%), loss of epithelial cohesion (35%), loss of stratification (34%) and nuclear pleomorphism (32%) were observed. The incidence of recurrence was also higher related to these same four features, as well as suprabasal mitoses (37%) and nuclear pleomorphism (41%).

Table 3 Incidence of transformation and recurrence by OED feature.

Feature-specific correlation to clinical outcomes

Table 4 summarises the hazard ratios and p values of individual OED features for their time to the two clinical outcomes of interest (malignant transformation and recurrence). Six features were associated with a greater rate of transformation: bulbous/drop shaped rete pegs (p = 0.005) hyperchromatism (p = 0.036), loss of epithelial cohesion (p = 0.003), loss of stratification (p = 0.001), suprabasal mitoses (p = 0.022) and nuclear pleomorphism (p = 0.005).

Table 4 Hazard ratios and p values of individual OED features for their time to malignant transformation and recurrence.

These same six features (bulbous/drop shaped rete pegs p = 0.036, hyperchromatism p = 0.015, loss of epithelial cohesion p = 0.001, loss of stratification p < 0.001, suprabasal mitoses p = 0.006, nuclear pleomorphism p = 0.002), in addition to dyskeratosis (p = 0.042), were also positively associated with recurrence.

Proposed prognostic models for OED

Two prognostic models were explored to assess the potential for reliably predicting clinical outcomes of OED. In all cases, the number of covariates was minimised to limit the impact of overfitting.

Prognostic model 1: Six-point scoring system

The first scoring system allocated one point for the presence of each of the six OED features which were associated with a greater incidence of transformation and recurrence (bulbous/drop shaped rete pegs, hyperchromatism, loss of epithelial cohesion, loss of stratification, suprabasal mitoses, nuclear pleomorphism). Since the hazard ratios for these features (Table 4) are reasonably similar, each feature is allocated equal weight.

Figure 1A and B (see Supplementary Material) present the Kaplan–Meier survival curves for time to transformation and time to recurrence in relation to the number of features present using the six-point scoring model. The predicted transformation rate at 2 years is estimated at 2% (95% CI 0–16%) for 0–1-point scoring, 0% for 2–3-point scoring and 31% (95% CI 19–48%) for 4–6-point scoring. At 5 years, these figures increase to 5% (95% CI 1–18%) for 0–1-point scoring and 38% (95% CI 25–55%) for 4–6-point scoring; there is no change in the rate for 2–3-point scoring (0%). For recurrence of OED, the respective predicted rates at two and five years were shown to be: 5% (95% CI 1–18%) and 7% (95% CI 2–21%) for 0–1-points; 3% (95% CI 0–22%) and 7% (95% CI 2–25%) for 2–3-points; 36% (95% CI 23–53%) and 49% (95% CI 34–65%) for 4–6 points. The lower recurrence and transformation rate seen for 2–3-point scoring compared to 0–1 points is unexpected but is likely to be related to the much lower number of cases in the 2–3 point category compared to the others. Validation on a more balanced larger cohort would be useful to determine the significance of these findings.

Fig. 1
figure 1

Kaplan–Meier curves for time to transformation and recurrence for feature count based on the six-point scoring system (A, B) and the two-point scoring system (C, D).

Few transformations and recurrences occurred more than five years post-baseline, and for simplicity the prognostic performance was assessed on the basis of whether the event happened rather than the time taken to occur. Figure 2 (see Supplementary Material) shows the receiver-operator characteristic curve (ROC) for these. The sensitivity and specificity appeared best balanced by using a cut off for either 4 or 5 points, with less events (for transformation and recurrence) when fewer features were present. The AUROCs for transformation and recurrence were 0.799 and 0.776, respectively.

Fig. 2
figure 2

ROC curves for transformation and recurrence for feature count based on the six-point scoring system (A, B) and the two-point scoring system (C, D).

Prognostic model 2: Reduced two-point scoring system

The second scoring system selected two features with the best inter-rater agreement, and which were also associated with transformation and recurrence (i.e. loss of epithelial cohesion and bulbous/drop shaped rete pegs). Figure 1C and D (see Supplementary Material) show Kaplan–Meier survival curves for time to transformation and recurrence based on the presence or absence of these two features. The combined presence of both features appeared to be associated with a higher risk of malignant transformation (39%, 95% CI 23–62%) at five years, in comparison to the presence of a single feature alone (loss of epithelial cohesion [16%, 95% CI 8–33%], bulbous/drop-shaped rete pegs [25%, 95% CI 7–69%]). However, the presence of bulbous/drop shaped rete pegs showed a higher risk of recurrence at five years (50%, 95% CI 23–85%) as compared to the presence of loss of epithelial cohesion (22%, 95% CI 11–39%) or when both of features were present in combination (43%, 95% CI 26–66%).

Effect of patient/clinical characteristics on prognostic models

The association between patient characteristics (age, gender, intraoral site), OED histological grade and clinical outcomes were also assessed. Overall, there was a modest association between patient characteristics and clinical outcomes. However, there was a trend for higher rates of transformation and recurrence amongst older patients compared to younger, and generally with higher graded lesions as well. Moderate OED lesions were associated with a marginally higher rate of malignancy and recurrence in comparison to severe OED lesions (31% vs 15%, 38% vs 26%, respectively, Table 5). The rates for intraoral clinical sites were, at best, modestly associated with dysplasia outcomes. None of the features had an AUROC as high as that achieved by the two scoring systems.

Table 5 Incidence for transformation and recurrence by patient characteristics and OED histological grade.

Table 6 illustrates the effect of adding the clinical characteristics (age, gender) and histological grade (WHO and binary) to each of the prognostic models, as represented by the AUROC. Adding age and gender into the models only marginally improved the predictive ability of the scoring (6-point model: 0.810 transformation, 0.804 recurrence; 2-point model: 0.810 transformation, 0.759 recurrence), reflecting the modest association of these characteristics with transformation and recurrence. Adding the histological grade improved the models further, particularly with the WHO grade compared to the binary grade (6-point model: 0.837 vs 0.812 for transformation, 0.812 vs 0.790 for recurrence; 2-point system: 0.843 vs 0.805 for transformation, 0.780 vs 0.755 for recurrence). The number of intraoral site categories and the relatively sparse number of patients for some sites meant it was not possible to jointly model this along with the proposed scoring approaches.

Table 6 AUROC for each model incorporating age, gender and grading.

Comparison of proposed models to existing grading systems

The prognostic ability of the two proposed models were compared against the existing grading systems20. Both the ‘six-point’ and ‘two-point’ proposed models yielded a higher AUROC than achieved by either WHO or binary grading systems, although not all these differences were statistically significant. The more detailed six-point model demonstrated a statistically significantly higher AUROC than achieved by the WHO grading system for both transformation and recurrence, but a more marginal improvement over binary grading. The two-point model showed a significant improvement over WHO grading for transformation alone (Table 7).

Table 7 Comparison of AUROC between two-point and six-point models with existing grading systems.

Finally, the prognostic performance of the new models was calculated separately for each of the three raters, reflecting how the models are likely to be used in clinical practice. Both models showed reduced prognostic ability when used by a single rater, indicating a greater risk for misclassification compared to models that were based on consensus agreement. Of the 12 single-rater AUC measures derived from the proposed models, 11 remained higher than those derived from corresponding WHO or binary grade (Table 8). Nevertheless, this analysis indicates that significant improvements on existing grading requires greater levels of agreement by assessors.

Table 8 AUROC for two- and six-point models calculated separately for each rater.

Discussion

This study reveals important and novel information about the prognostic significance of individual histological features of OED. We have demonstrated histological feature-specific correlation of OED to malignant transformation and recurrence, which has allowed us to propose two prognostic scoring models with a potential to simplify and aid OED diagnosis and grading in the future.

Overall, nine histological features were shown to be most prevalent amongst our OED cohort (Table 2). The top two most common features were basal cell hyperplasia (crowding) and irregular surface keratin; neither of which are currently part of the WHO criteria for OED diagnosis, although our study did not show them to be strongly linked to transformation or recurrence. The least prevalent features were verrucous surface morphology, lymphocytic band, loss of epithelial cohesion, dyskeratosis and nuclear pleomorphism. Interestingly, the latter three of these features were positively associated with clinical outcomes of interest; loss of epithelial cohesion (transformation p = 0.003, recurrence p = 0.001), nuclear pleomorphism (transformation p = 0.005, recurrence p = 0.002) and dyskeratosis (recurrence p = 0.042) indicating that the presence of the features and not the frequency within the cohort was more important. It is evident that certain architectural features may be consistently easier to detect (even at lower magnification) as compared to other features at cellular or nuclear level. The use of immunohistochemical markers, such as Phosphorylated Histone H3 (PHH3) and Ki67 can be considered as adjuncts for the assessment of mitosis and cell proliferation21, although more extensive evaluation of their usefulness as a prognostic indicator in OED is needed.

Our study showed observer agreement to be the highest for verrucous surface morphology, abrupt orthokeratosis, lymphocytic band and loss of epithelial cohesion, and worst for hyperchromatism and suprabasal mitoses, further highlighting the difficulty in objective analysis of certain features in clinical practice, particularly the more ambiguously defined cytological atypia. Several studies have investigated the variability in inter- and intra-observer agreement in the diagnosis and grading of OED, with substantially different outcomes ranging from poor to high observer agreement22,23,24,25. One of the challenges that arises in analysing inter-rater agreement is the variation that exists in pathologists’ understanding and definitions of features due to their inherently subjective nature further complicated by the numerous changes to classifications and reporting definitions over the years. Although digital WSIs were used to mitigate the issue of variations in staining of glass slides for each pathologist, the experience of digital reporting/analysis may have caused some variation. In this study, apart from informal discussions there were no formal calibration exercises arranged prior to histological examination, as we had intended for grading and feature scoring to be most reflective of the real world and routine clinical practice. To overcome any deficiencies in feature prevalence and agreement, two chance-corrected measures were used, including bias adjusted Kappa and Gwet’s AC1, as per statistical recommendation26.

We found six histological features (bulbous/drop shaped rete pegs, hyperchromatism, loss of epithelial cohesion, loss of stratification, suprabasal mitoses, nuclear pleomorphism) to be associated with a greater incidence of transformation and recurrence. Although it is well acknowledged that atypical verrucous hyperplasia and/or keratoses are a subset of OPMD, and that proliferative verrucous leukoplakia has a high reported rate of malignant transformation27,28, we did not find a statistical association between verrucous surface morphology and clinical outcomes in our study.

Although there was a modest association between patient characteristics and clinical outcomes, there is a statistical trend for higher rates of transformation and recurrence amongst older patients as well as higher graded lesions. This trend is well supported in the literature and is thought to be related to the aggregation of genetic alterations, immunosenescence and chronic exposure to environmental risk factors with advancing age29,30. Interestingly though, lesions graded as moderate dysplasia were associated with a marginally higher rate of malignancy and recurrence in comparison to severe dysplasia grades (31% vs 15%, 38% vs 26%, respectively, Table 5). These findings could be explained by differences in treatments and clinical follow-up, particularly in relation to moderately graded OED lesions which are both challenging to diagnose/grade and treat. The lack of robust treatment guidelines means there is huge disparity in the management of such lesions between surgeons. Although our patient cohort was diagnosed at a single centre, differences in treatment regimens between regional hospitals, and medical/social risk factors are likely to have contributed to potential differences in their management. This further highlights the need for improved diagnostic methods which are independent of grade for more objective OED prognostication as well as more standardised treatment pathways.

We developed and assessed the potential of using two relatively simple point-based scoring systems, based on the presence or absence of certain histological features. Using the six-point model, patient scoring ‘4–6 points’ were predicted to be at the highest risk of malignant transformation and recurrence at five years, estimated at 38% (95% CI 25–55%) and 49% (95% CI 34–65%), respectively. For the two-point model, predictions suggest that the presence of bulbous/drop shaped rete pegs alone have a greater predictive association with transformation (25%, 95% CI 7–69%) and recurrence (50%, 95% CI 23–85%) at five years, compared to the presence of loss of epithelial cohesion alone (transformation at five years: 16%, 95% CI 8–33% and recurrence at five years: 22%, 95% CI 11–39%).

Comparing the two systems, the six-point model had a greater discriminant performance with more separation of the survival and ROC curves (Figs. 1 and 2, see Supplementary Material). Although it is important to highlight that based on the modest agreement between pathologists seen in this study, it is inevitable that the performance of this system may be weakened if there was only a single assessor conducting the analysis. In contrast, the two-point model is a simplified approach that focusses only on the two features with the best inter-rater agreement (presence of loss of epithelial cohesion and/or bulbous/drop shaped rete pegs which are easier to identify). This model retained predictive ability contained in the groupings (especially for transformation) whilst being less susceptible to inter-rater disagreement.

The authors acknowledge a few limitations of this study. The first relates to the relatively small sample size which was obtained from a single centre. However, the department in question is a regional and national referral centre in the UK and therefore receives tissue samples from multiple hospitals covering a wide geographical area, thereby providing a sufficiently varied cohort for this pilot study. Furthermore, whilst the sample size may be considered small, it is larger than other studies which have explored OED analysis or proposed alternate OED grading classifications12,16,21. Nevertheless, application of these findings to substantially larger multicentre cohorts will allow more robust validation of the proposed potential prognostic models31.

To the best of our knowledge, this is the first study to propose feature specific prognostic scoring models for OED. The proposed models have the potential to provide pathologists with greater insight into the risk of individual OED lesions based on feature-specific analysis, which will in turn aid clinical decision making with regards to treatment and follow-up. Larger validation of the models is required on multicentric cohorts, with prospective analysis to explore the impact of other clinical determinants such as medical/social risk factors as well as effects of treatment and frequency of monitoring. There is clearly potential for strengthening the predictive ability of the models by incorporating such measures.

Greater clarity on the definitions (and examples) for individual architectural and cytological features will greatly benefit pathologists with OED diagnosis/grading and help to improve intra-observer agreement. There is clearly a need for the development of a universal minimum dataset for the reporting of OED lesions, as well as benefit in double/consensus reporting by two pathologists to ensure accurate diagnosis and early treatment.