Reliability of colour and hardness clinical examinations in detecting dentine caries severity: a systematic review and meta-analysis

Dental caries is the most common human infectious disease and is caused by microorganisms producing acids, resulting in changes in dental tissue hardness and colour. However, the accuracy and reliability of dentine colour and hardness as indicators for carious lesion severity has never been assessed in a systematic review. By applying strict criteria, only seven papers (five randomized control trials and two diagnostic studies) were considered for full text qualitative and quantitative assessment. Only three studies produced high quality evidence and only four articles were considered for meta-analysis, as these provided log10 colony forming units (CFU) data from caries biopsies following colour and hardness clinical examinations. When comparing the amount of CFU isolated from carious biopsies from different colour and hardness categories, hardness clinical examination was found to be a statistically more discriminate test than colour clinical examination. Therefore, hardness clinical examination is more specific and reliable than colour to detect dentine carious lesion severity. Further large carefully designed clinical studies are needed to consolidate the findings of this systematic review.

The most common methods used for dental caries detection in clinical practice are visual and tactile examinations 15 . Under normal daylight, the colour of a lesion is categorized by visual comparison with a standard guide of four shades (yellow, light brown, dark brown, and black) which is prepared from photographs of primary dentinal carious lesions. On the other hand, the texture or hardness of lesions are classified into three grades (hard, medium or leathery, soft) as described by Hellyer et al. 16 . Briefly, under standard dental lighting, hard lesions are as hard as the surrounding tooth tissue, leathery lesions are penetrated by a new Ash No. 6 probe under modest pressure but displayed resistance to its withdrawal, while soft lesions are easily penetrated by a new Ash No. 6 probe under modest pressure and displayed no resistance to withdrawal of the probe 10,17 . However, visual and tactile examinations result in low reproducibility and low sensitivity due to their subjective nature, but produce highly specific outcomes 18 . Studies indicated that cariogenic microorganisms produce acids that destroy tooth structure, resulting in changes of colour, consistency, and moisture content of dental tissue. For instance, darker and softer carious lesions contained larger numbers of microorganisms 10,14,19,20 . Thus, to improve the sensitivity, specificity, and reliability of clinical dentine caries examination detection methods, it is suggested that visual and tactile criteria for caries detection should be assessed in relation to the carious microbial activity as this would provide an accurate consistent method to assess caries quantitatively and qualitatively.
Therefore, the aim of this study was to systematically review the literature into colour and hardness of dentine caries and their association with microbial activity, particularly the amount of microbial colony forming units (CFU) that are isolated clinically from biopsies of these lesions following colour and hardness clinical examinations. This potentially proves which of the categories of colour or hardness is more specific and reliable to reflect the severity of carious lesions.

Methods
The Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) guidelines were adopted for the current study 21 . search strategy. Studies that assessed the association between colour and hardness with the quantification of CFUs from biopsies of these carious lesions as an indicator of dentinal carious lesion severity were accessed using a defined search strategy in the following electronic databases from 1950 to April 1st, 2018: PubMed, Medline via Ovid, and Web of Science. Handsearching was also performed by accessing the following journals: Community Dentistry and Oral Epidemiology, Caries Research, Journal of Dental Research, Journal of Pediatric Dentistry, Journal of Dentistry, Journal of Oral Health and Preventive Dentistry, and Journal of Dentistry for Children to April 1 st , 2018. The following search terms were used: "color" or "colour" or "hardness" or "texture" or "consistency" and "dentin caries" or "dentine caries" and "microbiology" or "microflora" or "bacteria". eligibility criteria. Search strategy and literature search findings were reviewed by two authors (Hon and Mohamed) to determine whether the identified studies met the inclusion criteria. The inclusion criteria were as follows: (1) randomized controlled trial (RCT) or diagnostic tests studies; (2) colour and hardness scored clinically; (3) CFU reported; (4) dentine caries; (5) open primary dentine caries in deciduous or permanent teeth; (6) articles published in English; and (7) human studies in vivo. Animal studies, in vitro studies, reviews, comments, editorials, and non-English studies were excluded. Additionally, research that contained irreversible pulpitis, secondary caries, hidden caries or caries that could only be seen radiographically were excluded. Disagreements between the two authors were resolved by an independent reviewer (Lynch). For detailed information, please refer to Fig. 1.
Data extraction and quality assessment. Two authors (Hon and Mohamed) extracted data using a pre-set data extraction sheet. The data extraction sheet included the following information: author, title, population, intervention, comparison, sample size, confidence level, results and outcomes. For the data extraction details, please refer to Table 1. The quality of evidence in the included studies was assessed by the Critical Appraisal Skills Programme (CASP) tools 22,23 . The included studies were randomized controlled trials and diagnostic tests studies. Thus, the respective CASP tools were used to assess each study type. Each article was independently assessed by two authors (Hon and Mohamed), while disagreements were resolved by an independent reviewer (Lynch). Inter-examiner variability among the 2 examiners was measured by calculating the percentage of agreement (%) among the 2 examiners in each CASP checklist criteria. This is due to the low number of included studies (diagnostic studies n = 2 and RCTs n = 5) [24][25][26][27] . For detailed information, please refer to Fig. 2. statistical analysis. Clinical recordings using colour and hardness of carious dentine have been proposed as indicators of lesion severity. Microbiological biopsies of carious samples have been used to enumerate the numbers of microorganisms expressed as CFUs. The primary outcome measures analyzed were: total microbial load in CFU in each category of colour (yellow, light brown, dark brown, black) and hardness classification (soft, medium hard or leathery, hard). Differences (d colour ) in means of total microbial load between categories of the colour scale were assumed as a measure of "discriminant power" using this test. Three differences between adjacent categories were computed as: yellow vs. rest, yellow/light brown vs. dark brown/black, black vs. rest. Differences (d hardness ) in means of total microbial load between categories of the hardness scale were also assumed as a measure of "discriminant power" using this test: hard vs. rest, soft vs. rest. If the difference obtained from the colour test is significantly higher than those from the hardness test, it means that the colour test is more discriminating than the hardness test. Inversely, if the difference obtained from the colour test is significantly lower than those from the hardness test, it means that the colour test is less discriminating than the hardness test. Therefore, mean differences between all colour and hardness category differences were computed and the weighted mean difference (WMD) was the global effect measure in a random-effects model. www.nature.com/scientificreports www.nature.com/scientificreports/ In addition, the effect size of the differences between colour (or hardness) categories regarding mean CFU was also calculated. The effect size index for a conventional one-way ANOVA is… = − f 1 2 2   , whereby  2 is the ratio of the between-groups variance to the total variance. The larger the effect size "f " is, the larger the difference between mean total microbial load of categories is, meaning there is a higher power to discriminate. As a general convention, small = .
f 0 10, medium = . f 0 25, and large = . . f 0 40 An overall effect size was estimated weighting the individual numbers by the sample size of each study. The level of significance used in the analysis was 5% (α = 0.05). The software used to perform this meta-analysis was R 3.0.2 and its 'metafor' package. The software Gpower 3.1.3 was used to estimate effect sizes.

Results
study characteristics and quality assessment. Fig. 1 shows the detailed steps used for the literature search. Of the 64 potentially relevant articles, 26 articles were eligible for full text screening. However, 19 articles were further excluded because they did not meet the inclusion criteria. Please refer to Table 2 for detailed information on the excluded papers and the rationale for their exclusion. Therefore, only seven articles 10,14,19,[28][29][30][31] were included for full text quantitative and qualitative assessment in this systematic review. These seven articles were published between 1950 and 2018. Among these articles five were randomized controlled trials 19,28-31 and two were diagnostic test studies 10,14 . A-Randomized controlled trials. Five randomized controlled trial articles were included 19,28-31 . One article disclosed the gender of the participants: 51 females and 60 males 28 . Three articles detailed the age of participants age: 4-15 years old 28 , 12-23 years old 29 , and 5-8 years old 31 ; therefore, the range of the age groups of the participants within the latter three studies was 4-23 years old. Three articles specified the type of teeth that received intervention and comparator treatments: 94 mandibular second primary molars and 60 mandibular first permanent molars 28 and primary molars 30,31 . Only one article provided the number of subjects recruited 28 , whilst the other studies only reported the number of teeth investigated [29][30][31] . Sample size calculations were only carried out by Lula et al., who conducted sample size calculations based on a pilot study where 16 teeth obtained an 80% power at a 5% statistical significance. It was noticeable that the article by Bjorndal et al. did not give details as to the gender, age, or type of teeth used (incisor, canine, premolars, molars, primary dentition, and permanent dentition). www.nature.com/scientificreports www.nature.com/scientificreports/ As we investigated the intervention and comparison groups, colour and hardness of lesions were related to the total CFU each contained. Two articles scored colour and hardness of lesions whilst evaluating different levels of carious dentine prior to final restoration in a stepwise approach 19,28 , which was using incomplete removal of dentine caries to try to prevent pulpal exposure. Three articles related total CFU in carious dentine with colour and hardness at different time points, at intervals of 6-7 months 29 , 4-6 months 30 , and 3-6 months 31 . Refer to Table 1 for further details on study characteristics. The CASP quality assessment for these included randomized controlled trials is presented in Fig. 2. According to the CASP analysis, the study of Lula et al. produced the highest quality of evidence and therefore its findings carried more weight. The inter-examiner variability for quality assessment was 72.7% for the five RCT's, signifying high reproducibility between the two examiners.

B-Diagnostic test studies.
Two articles were diagnostic test studies 10,14 . Both articles examined the colour and hardness of carious lesions and compared these to their respective total CFU counts 10,14 . Both these articles reported the number of participants and gender: 45 females and 72 males 14 , 25 females and 34 males 10 . Therefore, the total number of participants in each gender was 70 females and 106 males. Both articles examined the same age range: 29-80 years old 10,14 . Both articles 10,14 reported the number of teeth included in each study but neither article detailed the type of teeth that had been examined (incisor, canine, premolars, molars, primary dentition, permanent dentition). Neither article reported any sample size calculation 10,14 . For further details on study characteristics, please refer to Table 1. According to the CASP quality assessment, both articles produced equally high-quality evidence. For further detail CASP analysis of the aforementioned articles, please refer to Fig. 2. The inter-examiner variability for quality assessment was 83.3% for the diagnostic test studies, signifying high reproducibility between the two examiners.

Comparison of CFU's Between Colour and Hardness examination Methods
A-Randomized controlled trials. One article compared CFUs in one-visit indirect pulp treatment (IPT) with two-visit IPT and direct complete excavation (DCE) 28 . Four articles were paired study designs that compared samples from different time points 19,[29][30][31] . These four articles measured the CFUs taken for colour or hardness categories. The total CFU counts were only measured and compared within the categories of caries colour or categories of caries hardness independently 19,[28][29][30] . Only one article related both colour and hardness to total CFU. This study also compared the CFU counts within each caries colour category (i.e. yellow, light brown, dark brown) and within caries hardness categories (i.e. soft, medium hard, hard) 31 . www.nature.com/scientificreports www.nature.com/scientificreports/ Figure 2. Qualitative analysis with CASP tools for randomized controlled trials. Summary review of the qualitative assessment of the included studies by using CASP tools for randomized controlled trials (RCTs) consisting of 11 quality criteria (a) and for diagnostic studies consisting of 12 quality criteria (b). Green-coded circle indicates that the study satisfactorily met the respective quality criterion, yellow-coded circle indicates that the study partially met the respective quality criterion, and the red-coded circle indicates that the study did not meet the respective quality criterion.

Reason for exclusion
Kidd et al. 34 Colour not recorded Weerheijm et al. 35 Hidden caries, not gross occlusal caries Kidd et al. 36 Includes secondary caries Ayna et al. 20 No results correlating hardness and CFU Loesche et al. 37 Incipient caries Manji et al. 38 No CFU Bönecker et al. 39 No CFU Iwami et al. 40 No CFU Fusayama et al. 41 No CFU Iwami et al. 42 No CFU Torii et al. 43 No CFU Iwami et al. 44 No CFU Nyvad et al. 45 No CFU Milnes et al. 46 No CFU Nyvad et al. 47 No CFU Fejerskov et al. 48 Review Article Kidd et al. 49 Review Article Takashashi et al. 50 Review Article Kidd et al. 51 Secondary Caries www.nature.com/scientificreports www.nature.com/scientificreports/ Findings of statistical significance are summarized in the study characteristic Table 1. Two articles 28,31 found no statistical significance in total CFU, S. mutans or for Lactobacillus spp., recovered from different colour categories but harder lesions contained less total CFU than softer lesions 28,31 . One article did not carry out any statistical analysis between the microflora associations with colour and/or hardness categories because the study did not include sufficient number of samples 30 .

B-Diagnostic test studies.
Both articles assessed the relationships between total CFU with colour and hardness 10,14 . Total CFUs was highest in all soft colour categories compared to all colour categories of leathery lesions which in turn contained more total CFU than all hard colour categories 10,14 , and black soft lesions contained more lactobacilli than black leathery lesions 10 . For further details, please refer to Table 1. statistical analysis for pooled CFU's outcomes. Some of the included papers had to be excluded from the meta-analysis because they did not have a mean load for each category 28 , where microbial levels were only measured by turbidity methods 28,30 , or where data were only shown as percentages of specific species instead of the total CFU 14 Table 3. Shows the initial summary data in the included studies: means (sd).   Table 3. When lesions were described as leathery these were classified as medium for this analysis.
When comparing B-DB/LB/Y vs. S-M/H, the only possible comparison was by using the study by Beighton et al. It was found that there were more CFU (2.32 units of log 10 CFU in average) in black samples than any of the other colour groups. In addition, it was found that there were on average 3.72 units more of log 10 CFU in soft samples than in leathery samples and in turn these had significantly more CFU than in hard lesions. The differences in CFU between the hardness categories were considerably more than any differences between the CFU contained in the different colour categories. Therefore, the hardness test was more discriminant, but no meta-analysis was conducted because only 1 study was involved in this comparison as shown in Fig. 3.
When comparing B-DB/LB/Y vs. S/M -H, Beighton et al. and Bjorndal et al. provided data in relation to CFU differences between these categories, so a meta-analysis was performed, and a forest plot is shown in Fig. 3. The WMD value for the difference was −0.067. The effect size on the population was estimated to be between −1.88 and 1.74 with a confidence of 95%. It should be noted that the interval includes zero, so no significant difference was reached (p = 0.942). Therefore, none of these tests were more discriminant than the other.
Because The WMD value for the difference was −3.67 (p < 0.001), indicating that the hardness test was more discriminant than colour. The analysis for this forest plot is shown in Fig. 4.
When comparing B/DB -LB/Y vs. S/M -H, the results are similar to the previous comparison. The WMD value for the difference was −3.37 (p < 0.001), suggesting that the hardness test was again more discriminant than colour and the forest plot of this analysis is presented in Fig. 4.
When comparing B/DB/LB -Y vs. S -M/H, the WMD value for the difference was −4.74 (p < 0.001), supporting that the hardness test was again more discriminate than colour. The analysis for this forest plot is shown in Fig. 5.
When comparing B/DB/LB -Y vs. S/M -H, the WMD value for the difference was −4.45 (p < 0.001), also indicating that the hardness test was more discriminant than colour. The forest plot of this analysis is shown in Fig. 5.   Table 4.

Discussion
Dental caries involve microorganisms excreting acid resulting in changes in dental tissue hardness and colour. However, the accuracy and reliability of dentine colour and hardness as indicators for carious lesion severity has never been assessed in a systematic review. Thus, the aim of this study was to systematically analyze published research investigating whether colour or hardness of dentine caries was a more accurate, reliable, and valid method in detecting carious lesion severity when related to the amount of detectable CFU in biopsies of these lesion categories.
Seven papers met the inclusion criteria for this systematic review from searching multiple electronic databases and hand searching multiple journals: five articles were RCTs and two were diagnostic test studies. The five included RCTs were not primarily conducted to investigate the relationship between colour and hardness with the numbers of microorganisms. They were rather trial studies to investigate certain clinical interventions (e.g. one-visit IPT, two-visit IPT, DCE, step-wise excavation, and atraumatic restorative treatment). For the purpose of this systematic review, only the RCT data related to the CFU counts obtained from carious biopsies following the colour and hardness measurements were reported.
The quality of each study was critically appraised using CASP tools. The RCT and diagnostic test studies have separate CASP tools to systematically examine and appraise the evidences of each article. Using CASP tools, it was noticeable that many articles lacked details of patient's age, gender, type of teeth, socioeconomic status, ethnicity, diet, or sample size calculations 19,30,31 . This would question the quality of evidence in the included papers. In fact, the unreported randomization protocols 10,14,19,[29][30][31] , sample size calculations 10,14,19,[28][29][30] , and confidence limits 10,14,19,[28][29][30][31]  From the meta-analysis, only 4 articles were considered for meta-analysis, as these provided CFU data for different categories of colour and hardness. Differences in log 10 CFU between categories of hardness were significantly greater than between categories of colour. However, CFU results were also not consistently reported.   www.nature.com/scientificreports www.nature.com/scientificreports/ results from the meta-analysis inherited mostly properties and results from Beighton et al. As with weighted effect size calculations, results data also showed a large advantage for hardness examination compared to colour examination. However, unlike pooled meta-analysis calculations, the study by Bönecker et al. was added to the weighted total effect size calculations. This was possible because effect size can be calculated regardless of the differences in the reported units of CFUs as effect size calculations measure ratio of variability between groups to within-groups 32 . Nevertheless, the meta-analyses and effect size calculations data in this study indicate that the texture (or hardness) tactile clinical examination has more discriminatory power in comparison to the colour visual clinical examination. Therefore, hardness or texture categories are more reliable in reflecting the amount of isolated cariogenic microorganisms and thereby more reliable in detecting the severity of carious lesions. Thus, based on the findings of this study, the hardness tactile clinical examination is a more specific and sensitive clinical examination method as hardness categories would be more reliable indicators for treatment planning in which soft carious lesions would harbor more cariogenic microorganisms, if compared to hard carious lesions, and therefore may require a more intrusive dental clinical intervention. These findings are consistent with previously reported data in which hard carious lesions were found to harbour no cariogenic bacteria (e.g. neither streptococci nor lactobacilli) and therefore required no dental clinical interventions. On the other hand, a large proportion of soft carious lesions contained the latter species (63.6% and 48.4% respectively) and required both caries debridement and restoration 10 .
The findings of this systematic review can be viewed as additional evidence in support of the International and Caries Detection Assessment System (ICDAS), which is currently considered as the recommended dental caries examination scoring index in dental practice 33 . In this study, hardness clinical examination was found to be more specific and reliable than colour examination. Thus, changes in dental tissue hardness should warrant more scores in dental caries classification indices, which match the current setup of the ICDAS that emphasizes on the importance of hardness clinical examination by allocating more scores to changes in hardness 33 .
The findings of this systematic review must be interpreted with caution as there are a number of limitations. Firstly, it was not possible to group study data and findings in all of the included 7 studies due to the inconsistency and heterogeneous approaches in which the authors recorded teeth colour, hardness, and CFU. For instance, the articles reported various methodologies for detecting and analyzing microbial CFU. Also, the authors used dissimilar approaches to report CFU data, such as: some studies presented the results with "total CFU counts" 19,28 , while other studies used "turbidity tests" 28,30 . Other articles normalized the CFU data by using log 10 (CFU/mg) 31 or log 10 (CFU + 1) 10,14,29 . Secondly, all data tables presented by the diagnostic test studies 10,14 showed CFU percentages of different bacterial species instead of stating the exact numerical actual counts, which further made it difficult to group and compare data and findings across the studies and therefore made it difficult for conducting a meta-analysis of all included studies which also hampered our ability to have an overall systematic understanding of the data. Finally, most of the included studies were RCTs, a design that is more vulnerable to sampling bias without accurate sample size calculations. It was noticeable in this systematic review that sample size calculation was only reported in one study 31 .
In summary, this study presents systematically reviewed evidence in support of hardness clinical examination being a more reliable and specific dental caries detection method. Therefore, it is recommended to consider the use of tactile examination over visual inspection for caries detection during routine and treatment dental visits. However, due to the limitations in this study, further research is needed to consolidate the findings in this systematic review.
In conclusion, the microbial differences between hardness types showed a weighted total effect size with a large advantage for hardness compared to colour and colour alone also yielded inconsistent microbial results. Hardness is more reliable than colour to detect dentine caries severity.  Table 4. Summary of the effect size (f) of the multiple comparisons of bacterial load means in colour and hardness examinations. Effect size (f) for a one-way analysis of variance to estimate differences in total bacteria load between categories of the colour and hardness test. Total effect size is the weighted (by sample size) mean from the authors. It is possible to estimate the effect size for Bönecker because it does not depend on the original units. The effect size is a standardized value.