INTRODUCTION

Since orthodontics was first introduced into China in the 1950s, it has experienced an early stage (from 1950s to late 1970s) and a rapidly developing later stage (1980s till now).1 With recent economic and social developments, increasing numbers of Chinese people are seeking orthodontic treatment for improved facial appearance and oral health. There are now more than 300 000 Chinese patients accepting orthodontic treatment each year,2 and government spending on orthodontic education is increasing to meet this great demand. At present, there are 37 universities conducting postgraduate orthodontic programs.3 More than 2 000 orthodontic specialists and thousands of general practitioners are delivering orthodontic treatment around China. Hence, it is opportune to evaluate the effectiveness of orthodontic treatment provided by different orthodontic services.

Assessment of orthodontic treatment outcome is an important issue that has been discussed since the 1970s.4,5 An objective evaluation system is necessary to determine whether a treatment outcome is successful or unsuccessful. Several orthodontic treatment outcome indicators have been used over the past 10–20 years, including the Peer Assessment Rating (PAR)6,7 established in 1992 in Europe, the Objective Grading System8,9 drawn up in 1998 by the American Board of Orthodontics and the Index of Complexity, Outcome and Need (ICON)10,11,12 proposed by a collaboration of centers in Germany, Greece, Hungary, Italy, The Netherlands, Norway, Spain, the United Kingdom and the United States. As a general rule, objective measures are developed from prior subjective evaluations by a group of authorities whose judgment is regarded as the ‘gold standard’.13 Thus, the average perception of experienced orthodontists is the cornerstone of systems for the evaluation of orthodontic treatment outcome.

Study casts, cephalometric images and facial photographs are the three essential elements of the evaluation of orthodontic treatment outcome. However, most research has focused mainly on study casts or dental esthetics.5,6,7,8,9,10,11,12,13,14 Owing to the difficulty of measuring and integrating data from cephalometric images and facial photographs, these information sources have often been ignored. In the present study, to achieve a comprehensive evaluation, study casts, lateral cephalogram and facial photographs were all involved in the assessments.

The aims of this study were to assess the agreements among experienced Chinese orthodontists when evaluating the treatment outcome in Chinese patients, to quantify the contribution of each of the three information sources to the combined evaluation and to determine the correlations among the three information sources in the combined assessments.

MATERIALS AND METHODS

Six orthodontic treatment centers in different parts of China collaborated in this study, including the Peking University School of Stomatology, the West China School of Stomatology at Sichuan University, the School of Stomatology at the Fourth Military Medical University, the Beijing Stomatological Hospital and School of Stomatology at Capital Medical University, the Stomatological Hospital and College of Nanjing Medical University and the Hospital of Stomatology at Wuhan University. At each center, a list of no fewer than 300 patients with full records for whom orthodontic treatment had been completed between 2005–2008 was collected. The initial lists from the six centers, which totaled 2 383 patients, were forwarded to the Peking University School of Stomatology for further processing including randomization. A stratified random sample of 108 patients was drawn from the larger sample, balanced to contain 18 cases from each collaborating center and equal numbers of Angle Class I, Class II and Class III malocclusions. This sample was further randomized with adaptive allocation to produce nine groups containing 12 patients each. Each group contained four Angle Class I, four Angle Class II and four Angle Class III treated patients. Seventy-two patients were aged less than 18 years; the remaining 36 were adults 18 years of age and older. There were 30 males and 78 females. Signed informed consents of the patients and institutional permissions from all six participating orthodontic treatment centers for the study were obtained.

For each patient, the post-treatment information sources evaluated in this study included (i) standardized plaster study casts (SC); (ii) a lateral cephalometric X-ray image (LX); and (iii) facial photographs (PH) that included profile, frontal and frontal smiling views (Figure 1). These information sources were evaluated both singly and in combination by a panel of 69 judges assembled for the purpose at Peking University School of Stomatology for a dedicated period of 3 days. The panel of judges included 69 experienced orthodontic specialists recommended by the six participating treatment centers in consideration of their representativeness of different districts of mainland China. The inclusion criteria for judges were that each had:

  1. i)

    an MS or PhD degree in orthodontics or experience as a research supervisor of orthodontic postgraduates;

  2. ii)

    no less than 10 years of clinical experience in orthodontics;

  3. iii)

    the academic rank of Associate Professor or above.

Figure 1
figure 1

Representative case records of one patient. Each patient’s records contained study casts, a lateral cephalometric X-ray image and a triplet of facial photographs.

Thirty-eight judges were males and 31 were females.

The three information sources were evaluated individually and in the following combinations: study casts and lateral cephalogram (SC+LX); study casts and facial photographs (SC+PH); study casts, cephalogram and facial photographs combined (SC+LX+PH). Judges were invited to examine the upper and lower study casts in occlusion and separately, manipulating them by hand. They made the assessment for ‘facial attractiveness’ from the end-of-treatment lateral cephalometric X-ray images and facial photographs. They were not instructed which component should be prioritized, but made the judgments based on their own clinical experience.

For each group of records, two separate assessments were made. In the first assessment (ranking), each judge ordered the 12 records in each group from 1 (most favorable) to 12 (least favorable) with respect to treatment outcome. In the second assessment (grading), the judge divided the group of 12 cases into three categories: satisfactory, acceptable and unsatisfactory. This procedure served as a protection against chance concentration of more acceptable or less acceptable cases into any individual group of 12 cases. The satisfactory cases were assigned a value of one point, the acceptable cases were assigned two points and the unsatisfactory cases were assigned three points to obtain an average value for each patient. The lower the ordinal rating score, the better the treatment outcome.

Two to four months before commencing the main study, over a dedicated 2-day period and using the proposed criteria, each judge evaluated four groups of cases treated locally.15,16

Statistical analysis

All statistical analyses were performed using Statistical Product and Service Solutions (SPSS) software (V20.0; IBM SPSS Statistics, IBM, Armonk, NY, USA). Spearman rank-order correlation coefficients (rs) analyses were computed to show agreement in ranking between all judge pairs. The Kendall coefficients of concordance (W) were calculated to determine the overall agreement in rankings and gradings among all judges. Stepwise linear regression and Pearson product–moment correlation coefficient (r) analyses were conducted to assess the contribution of single information source assessments to the combined assessment and to determine correlations among the three combined assessments. Graphs were generated using MATLAB (R2011b; MathWorks, Natick, MA, USA), Visio (Microsoft Visio Premium 2010; Microsoft, Redmond, WA, USA) and SPSS software. One judge’s data for the SC group and two judges’ data for the SC+LX group were not obtained. The level of probability for statistical significance was set at α=0.05.

RESULTS

Our first task was to determine agreement between and among the subjective evaluations using the Spearman correlation coefficient for pairs of ranking scores and the Kendall coefficient of concordance for group ranking and grading scores.

This produced 2 346 (C692) pairs for each comparison. Overall, for all six comparisons, there were 13 873 Spearman rs values, ranging from as high as +0.852 to as low as –0.077. The maximum, minimum and mean values, and the upper, median and lower quartiles of the rs values for each ranking are displayed in Figure 2. In total, 103 pairs of rs failed to reach statistical significance at the 0.05 level, including 0 pairs for SC and SC+LX+PH, 75 pairs for LX, 13 pairs for PH, 7 pairs for SC+LX and 8 pairs for SC+PH (statistical significance at the 0.05 level required an rs value greater than 0.190). From Figure 2, it can be seen that the judges’ average agreement for ranking was relatively high for SC, relatively low for LX and PH and moderate for SC+LX, SC+PH and SC+LX+PH.

Figure 2
figure 2

Box and whisker plots showing the judges’ agreements for the ranking in each test group ( n =108). LX, lateral cephalometric X-ray image; PH, facial photograph; SC, study cast.

The percentages of highly correlated (rs>0.7), moderately correlated (0.4≤rs<0.7), slightly correlated (0≤rs<0.4) and negatively correlated (rs<0) pairs are listed for each group in Table 1. Of the 13 873 Spearman rs values, 10.92% were above 0.70, 79.3% were between 0.40 and 0.70, 9.77% were between 0 and 0.40 and 0.04% were below 0, showing that agreement between individual pairs of judges was highly variable. Agreement also varied among the six test groups.

Table 1 Distribution of Spearman rs values for 108 cases /%

All Kendall W values were statistically significant (P<0.05), indicating that the overall agreement between the judges was good in both ranking and grading (Table 2).

Table 2 Kendall W values of ranking and grading (P<0.05, n=108)

Our second task was to assess the contribution of the evaluation of single information source assessments to the combined assessment. First, it was important to determine the correlation between each single information source assessment and the combined effect of all three sources, as illustrated in Figure 3. Here, the average grading scores for each single information source assessment are seen to be correlated with the average grading scores for SC+LX+PH for the same case as assessed by the same judge in each plot.

Figure 3
figure 3

Scatter plots comparing associations between SC, LX, PH and SC+LX+PH. For each scatter plot, n=108 and P<0.05. LX, lateral cephalometric X-ray image; PH, facial photograph; SC, study cast.

The relationship between the average grading scores for SC+LX+PH and those for each single information source assessment are further explored using Pearson r analyses in Table 3, which showed that the grading scores for SC predicted those for SC+LX+PH with extremely high significance (r=0.93). However, the grading scores for lateral cephalometric films (r=0.53) and facial photographs (r=0.57) predicted merely a small portion of those for SC+LX+PH. SC and LX (r=0.34) and SC and PH (r=0.30) exhibited low correlations. LX and PH were moderately correlated (r=0.52).

Table 3 Pearson correlation (r) and R2 value (n=108)

The results of the linear stepwise regression analysis for the grading scores are presented in Table 4. SC entered into the analysis first, accounting for an R2 value of 0.86. PH entered next, adding 0.09, and LX entered last, adding a small but statistically significant increment of 0.01, to obtain a total R2 of 0.96. From these results, it is reasonable to assume that 96% of the variability in the average grading scores for SC+LX+PH can be accounted for by the scores for SC, LX and PH. [SC+LX+PH]=–0.29+0.76×[SC]+0.24×[PH]+0.13×[LX]+error.

Table 4 Stepwise regression analysis (dependent variable, SC+LX+PH; independent variables, SC, LX, PH; n=108)

Our third task was to determine correlations among the three information sources in the combined evaluations. As shown in Table 5 and illustrated in Figure 4, in each respective scatter plot, the average grading scores demonstrated extremely high correlations for SC+LX vs. SC+LX+PH (r=0.96), SC+PH vs. SC+LX+PH (r=0.97) and SC+LX vs. SC+PH (r=0.95).

Table 5 Pearson correlation (r) and R2 values (n=108)
Figure 4
figure 4

Scatter plots comparing associations among SC + LX, SC+PH and SC+LX+PH. For each scatter plot, n=108 and P<0.05. LX, lateral cephalometric X-ray image; PH, facial photograph; SC, study cast.

DISCUSSION

The Visual Analog Scale17 and Q-Sort assessment18 are the two methods most commonly used for subjective evaluation. The Visual Analog Scale was used in the development of PAR6,7 and ICON,10,11 the subjective perspective being described by several scales. Q-Sort assessment is used predominantly in psychology,19 though several studies of occlusal index and smile esthetics have used Q-Sort to achieve a logical subjective evaluation.20,21 Sixty to 120 items are placed in order of representativeness, the most characteristic being given a high score and the least characteristic a low score.

When subjectively evaluating end-of-treatment cases, because the differences between patients tend to be small, judges are inclined to mark only approximately, using a moderate scale on which it is difficult to distinguish details. Moreover, it is impracticable to rank a large number of cases at one time. In this study, 12 cases were assessed in each information source, using study casts, cephalometric films and/or facial photographs. For ranking, the judges were asked to examine each patient’s case records carefully and to compare them one-by-one within each test group to produce a sequence of 12 cases. For grading, the judges were asked to divide the 12 ordered cases into three ordinal categories—satisfactory, acceptable and unsatisfactory—to allow comparisons among the groups. By using both ranking and grading assessments, detailed and reliable results were obtained.

Validity22 and reliability23 are two basic necessities for any evaluation system or index. In this study, we were concerned with the agreement among 69 experienced orthodontists of their subjective perceptions of orthodontic treatment outcome. Subjective perceptions can be averaged to obtain a ‘gold standard’ only when the reliability is good.

In previous studies on agreement among orthodontists, many measures of association and concordance have been used, including the intraclass correlation coefficient (ICC) statistics and kappa statistics, the Pearson and Spearman correlation tests, the Kendall correlation analysis and the root mean square error. Except for the root mean square error, the coefficient or score for each test is a value between 0 and +1.0 or −1.0 and +1.0, where values closer to +1.0 represent higher reliability. The root mean square error is an estimate of the standard deviation representing the measurement error of a single measurement.23,24

The ICC statistic employs repeated measures analysis of variance, use of which is appropriate only when the underlying model is assumed to be a one-way random effects model.25 ICCs are regularly used in the analysis of measurement data such as the objective scores of PAR,6 American Board of Orthodontics-Objective Grading System26 and ICON.13 Pinskaya et al.27 reported excellent measuring repeatability, with ICCs of 0.97 and 0.98 for intra-examiner reliability, and an ICC of 0.98 for inter-examiner reliability.

Kappa statistics28 are widely used to assess the reliability of categorical measurement; for example, the Dental Health Component of the Index of Orthodontic Treatment Need29 and judges’ subjective perception in the PAR and ICON.6,7,13 When developing the PAR index in 1992, Richmond et al.6 reported kappa scores of 0.39–0.87 for intra-judge reliability and 0.43–0.58 for inter-judge reliability. In a study validating the complexity and treatment outcome components of ICON, Nicholas et al.13 reported that the inter-rater kappa agreements were moderate (0.50) for complexity, slight (0.18) for outcome and poor (0.04) for degree of improvement.

The Pearson and Spearman correlation tests are used to assess reliability between pairs of judges ranking data that are distributed normally or non-normally, respectively.30,31,32,33,34 Xu et al.30 used the Pearson correlation test to assess agreement and disagreement among pairs of Chinese and US orthodontists. They found that the correlations ranged from +0.004 to +0.96 with a median of +0.54.

The Kendall correlation analysis evaluates agreement in a group of judges making ordered categorical measurements.35 When testing the agreement among six judges in an early study developing the occlusal index,36 the high Kendall coefficient obtained of 0.881 indicated that the judges were applying essentially the same criteria in their assessments.

The present study addressed inter-judge reliability related to subjective evaluation; that is, the extent to which ranking and grading evaluations were repeatable among the 69 judges. Two aspects of judge homogeneity were considered: agreement between judge pairs and agreement among the whole panel. Spearman analysis indicated that, for ranking, the agreement between judge pairs tended to be strongly significant, which was similar to our previous study.16 All Kendall coefficients of concordance were statistically significant, demonstrating high consistency among the panel of Chinese judges for both the ranking assessment and the grading assessment.

However, Spearman correlations varied among the six test groups. The mean Spearman correlation was the highest for SC compared with the other five comparisons, and most of the correlations for SC were above 0.4. Orthodontists aim to achieve as near perfect occlusion as possible, as defined in Andrew’s six keys to occlusion. 37 Due to their similar background knowledge and clinical experience, the 69 judges provided an almost standardized subjective assessment of the study casts, with high inter-judge reliability. The mean Spearman rs value was the lowest for LX, where 21.4% of correlations were below 0.4 and 75 pairs of Spearman rs failed to reach statistical significance at the 0.05 level. This considerable disagreement could be attributed to the diversity of cephalometric analytical methods and contrary opinions about orthodontic camouflage treatment. The mean Spearman rs values for SC+LX+PH, SC+LX and SC+PH were comparable (Figure 1 and Table 1). However, all pairs of Spearman rs values for SC+LX+PH were statistically significant at the 0.05 level, whereas seven pairs of those for SC+LX and eight pairs of those for SC+PH were statistically non-significant. The Spearman correlation results for PH in this study were similar to those of serial studies of facial attractiveness.30,31

Our study is the first attempt to use combinations of study casts, cephalometric films and facial photographs for the subjective evaluation of orthodontic treatment outcome. In the past, these different materials have usually been assessed separately. It was important to determine the relationship between the different sources and the contribution of single information source assessments to the combined evaluation. The Pearson correlation test and stepwise regression were conducted for this purpose.

Table 3 showed that, when assessed separately, the three types of end-of-treatment records were correlated with each other to some extent. The Pearson correlation between SC and LX was low, as was that between SC and PH (rSCvs.LX=0.34, rSCvs.PH=0.30, P<0.001). The correlation between LX and PH was moderate (rLXvs.PH=0.52, P<0.001), being similar to the finding of Oh et al.38 This higher, moderate association between the cephalometric films and facial photographs is probably related to their shared information about the facial profile.

Stepwise regression analysis illustrated how the three single information sources complemented one another when combined. Table 4 showed that the grading scores for SC predicted 86.3% of the grading scores for SC+LX+PH, with the scores for PH increasing this to 95.5% and LX to 96.5% (P<0.001). In the regression model, though SC entered into the regression equation first, PH and LX both added statistically significantly to the R2 value. Tables 3 also showed that the grading scores for LX separately predicted 28% of the variability for SC+LX+PH (P<0.001), while the grading scores for PH separately predicted 32% of the variability for SC+LX+PH (P<0.001).

Regarding correlations between the three combined sources, Table 5 showed that the grading scores for SC+LX, SC+PH and SC+LX+PH were highly significantly correlated with each other (r(SC+LX)vs.(SC+PH)=0.95, r(SC+LX)vs.(SC+LX+PH)=0.96, r(SC+PH)vs.(SC+LX+PH)=0.97), which explains the similar mean agreement between the combined evaluations. R2 values indicated that the grading scores for SC+LX and SC+PH separately predicted 92% and 94%, respectively, of the variability of the grading scores for SC+LX+PH. In view of this, it is reasonable to conclude that the grading scores for SC+LX and SC+PH predicted the grading scores for SC+LX+PH extremely well.

Based on the above discussed data, it is highly desirable to include both lateral cephalometric films and facial photographs for a comprehensive evaluation of orthodontic treatment outcome, though study casts play a significant role in the combined assessment.

CONCLUSIONS

A panel of 69 Chinese orthodontists demonstrated mainly moderate to high consistency for each group of end-of-treatment study casts, cephalometric films and facial photographs from 108 Chinese patients. In the combined group evaluations, study casts were the most significant predictive component, while lateral cephalometric films and facial photographs also contributed to a comprehensive assessment. Since the grading scores for SC+LX and SC+PH were highly significantly correlated with those for SC+LX+PH, either study casts plus lateral cephalometric films or study casts plus facial photographs is an excellent substitute for all three combined assessment methods.