Introduction

Class II skeletal malocclusion is a relatively common condition, characterized by a skeletal disharmony that usually results in a reduced chin projection and a convex facial profile1. These characteristics, in addition to other typical features of Class II malocclusions, like protruded upper incisors, a decreased mentolabial angle, a retruded lower lip, and a short chin–throat distance, could impair the subject’s facial attractiveness1,2,3.

Several studies have reported that an unbalanced facial appearance could negatively affect both self-esteem and social perception, with an impact on bullying and career opportunities4,5,6. In fact, patients with an attractive smile are perceived as more intelligent and easily employed compared to subjects with an unattractive dentition7,8, have higher chances of getting a better job position, and are perceived as having a better socioeconomic status6.

Consequently, facial attractiveness should be one of the treatment goals for Class II skeletal malocclusion, along with correct dental occlusion and a functional equilibrium. However, according to the current literature such aesthetic balance is not always achieved. While some authors reported facial improvement in young adults after treatment with the Herbst appliance9 and in children after functional treatment with a Twin-Block device10, others observed no perceived difference in facial attractiveness after treatment11.

Such results are not surprising, given that the literature is not even in agreement about the skeletal outcome of Class II functional appliance treatment. While some authors have reported a significant increase in mandibular length after treatment with functional appliances compared to controls12, another systematic review by Koretsi et al. indicated that the observed outcomes were mostly dentoalveolar rather than skeletal13. However, it should be noted that the latter review included many studies involving patients in a pre-pubertal skeletal growth stage, which is known to affect treatment efficacy14. A Cochrane systematic review concluded that, when comparing the outcome of one-phase treatment (the use of a fixed appliance in a post-pubertal stage) to those of two-phase treatments (the use of functional appliances during the pubertal stage, followed by treatment with fixed appliances), the only advantage that could favour early treatment with a functional appliance is a reduction in the incidence of upper incisors trauma15.

Based on these considerations, two questions arise: is early treatment with a functional appliance able to improve the aesthetic profile of the patient? Is this improvement perceived by the patient and does it improve their social interactions? Different studies have tried to answer the first question, with contradictory results16,17,18. Similarly, many authors have tried to evaluate how the aesthetic profile is perceived by professionals and laypeople6,19,20,21,22,23,24, reaching different conclusions.

The aim of the present study was to involve a large number of observers with different expertise—laypeople, dental students, general dentists and orthodontists—to evaluate the aesthetic perception of pre- and post-treatment facial profiles of patients with skeletal Class II treated with a removable functional appliance, compared to untreated controls.

Materials and methods

The present protocol was designed following the recommendations of the Declaration of Helsinki from 1975 and subsequent revisions, and was approved by the Internal Review Board of the University of L’Aquila (protocol no 15352, approval no 02/21). All the participants gave their written informed consent to participate.

Selection of the sample of skeletal Class II subjects

The clinical records of patients treated at the Dental Clinic, University of L’Aquila, Italy, from January 2005 to March 2020, were screened in chronological order for the following inclusion criteria: skeletal Class II malocclusion with ANB > 4° (the angle between maxillary A point, the skeletal Nasion point, and the mandibular B point), Class II division 1 with full-cusp molar Class II dental relationship orthodontic treatment with a Sander bite-jumping appliance, treatment onset at a CS3 cervical vertebral maturation (CVM)25 stage, high-quality pre- and post-treatment radiographic documentation, with clearly distinguishable soft tissues, and a treatment outcome that was considered successful by achieving a molar Class I. The screening ceased when the first 20 cases were selected. The demographic data (age and gender) and the pre-treatment (T0) and post-treatment (T1) lateral cephalograms were extracted from the records, anonymized and stored for further analysis. An inactive control group, of the same size (n = 20) and matched for ANB value, age, gender distribution, and CVM stage, was selected from the records of the American Association of Orthodontists Foundation (AAOF) craniofacial growth legacy collection (http://www.aaoflegacycollection.org/). For the control group, the T0 cephalogram was taken at an age comparable to that of the T0 of the study group, while the T1 cephalogram was selected after a time interval corresponding to the mean treatment time of the study group (24 months).

All the cephalograms were oriented with the Frankfurt plane parallel to the ground and then transformed into black and white silhouettes depicting the soft tissue facial profile, using image editing software (GNU Image Manipulation Program version 2.10, Free Software Foundation Inc., Boston, MA, USA). Each image was then cropped superiorly 1 cm above the subnasale point, anteriorly 2 cm away from the tip of the nose, inferiorly 4 cm below the throat point, and posteriorly through the tragus point (Fig. 1).

Figure 1
figure 1

Examples of silhouettes retrieved from patients’ profiles and included in the questionnaire: T0 (A) and T1 (B) silhouettes of a patient treated with Sander’s bite jumping appliance, and T0 (C) and T1 (D) silhouettes of an untreated control patient.

Preparation of the online questionnaire

A four-page online questionnaire was then prepared using an online service (http://www.surveymonkey.com/): the first page of the questionnaire was used to inform the responder about the study and obtain their informed consent; no information was given about the nature of the images and the scope of the study, to ensure blinding of the observers. The second page was used to collect information about gender, age, profession, and years of experience in the case of dental professionals. The third page presented three “calibration” images, with the intention to show three attractiveness categories, from the least attractive to the most pleasant. The three calibration images were obtained from the same radiograph of a subject with a mild skeletal Class II (ANB = 4.6°), a marked labio-mental angle, but a pleasant lip projection; this image was used as the “medium” reference (Fig. 2B). Then, the same image was modified by retruding the chin markedly to obtain a “very unpleasant” profile (Fig. 2C), and by moving the chin forward to simulate a skeletal Class I and modifying the lips to conform to the cephalometric norms23 to obtain a “very pleasant” profile (Fig. 2A). The fourth page showed the T0 and T1 pictures of the study group and the control group in a random order (a random sequence of numbers was obtained using an online tool www.randomizer.org) and without any identifier, plus one “ideal” profile (Fig. 3) showing the perfect proportions suggested by cephalometric norms and redrawn from a previous publication23. The observers were asked to rate the attractiveness of the image shown by using a visual analogue scale (VAS) bar from 0 (absolutely not attractive) to 10 (absolutely attractive). The meaning of the VAS score was explained only in the first page of the questionnaire.

Figure 2
figure 2

Calibration silhouettes showed to the observers for evaluation. (A) Calibration profile no. 1 depicting a “very pleasant” profile; (B) calibration profile no. 2 depicting a “medium” profile; (C) calibration profile no. 3 depicting a “very unpleasant” profile.

Figure 3
figure 3

Silhouette of the “ideal” profile redrawn from the publication of Czarnecki et al., 1993.

The questionnaire was disseminated online through social media to Italian people of any age and expertise, for 2 months. Taking the VAS score associated with the profiles by the observers as the primary endpoint and setting the null hypothesis H0: δ = 0, we calculated a-priori that 745 observers would be adequate assuming 90% power, 1% first type error, and a 0.2 Cohen threshold. After adjusting this number for an expected 10% rate of incomplete questionnaires, an overall sample size of 820 observers was defined.

Observers’ coherence

To evaluate each observer’s ability to perform a coherent aesthetic assessment, and to gain an impression of the overall quality of the scores, two binary variables were created: the external coherence, that is the ability to discriminate the ordinal categories of the three calibration profiles, and the internal coherence, that is the consistency in assigning the “very pleasant” calibration profile and the “ideal” profile to the same ordinal category of scores (i.e. internal coherence is present if the score assigned to the “ideal” profile is greater than the score given to the “medium” calibration profile by the same observer). An observer was considered incoherent if their external coherence and internal coherence had different values (i.e. externally coherent but internally incoherent, or vice-versa).

Statistical analysis

Descriptive statistics were calculated for all the variables. A Shapiro–Wilk normality test was used to assess the type of data distribution for all the variables. The ANB value and the age of the subjects at T0 in the study group and in the control group were compared using a T-test for independent samples. The gender distribution between the two groups was compared using a Pearson χ2 test. After a test for the equality of variances, a T-test for independent samples was performed to compare the T1–T0 differences in scores between the two groups. To evaluate the effects of treatment, observer’s age, gender, expertise, and coherence on the VAS scores, a multivariate random intercept model was performed.

To further analyse observers’ coherence, a Shapiro–Wilk test was used to compare the skewness of the scores attributed to the “ideal” profile by the coherent and the incoherent observers. The distribution of incoherent observers among genders and expertise was also studied through a Pearson χ2 test. Then, to gain an impression of the quality of the incoherent scores (that is, are they an expression of the observer’s very own perception or are they due to a random assignment of scores), a one-sample T-test for proportion was used to test the null hypothesis that the distribution of the two types of coherence among the incoherent observers was different from 0.5.

Results

The demographic characteristics of the skeletal Class II patients included in the treated group and in the control group are reported in Table 1. The two groups were comparable for gender distribution (Pearson χ2 = 0.032, p = 0.858), age at T0 (independent samples T-test, mean difference − 0.14, p = 0.757, assuming unequal variances) and age at T1 (independent samples T-test, mean difference 0.01, p = 0.990, assuming unequal variances). The Treatment group showed a mean ANB value of 5.9° ± 1.4, the Control group showed a mean ANB value of 5.6° ± 1.7, all being normally distributed, with no statistical difference between the two groups (independent samples T-test assuming equal variances, mean difference − 0.3°, p = 0.559).

Table 1 Demographic characteristics of the skeletal class II treated and untreated groups.

A total of 910 questionnaires were collected, with a completion rate of 61%. Thirty-eight percent of the responders (n = 343) were male and 62% (n = 567) were female. The majority of the responders (36%, n = 328) were aged between 25 and 34 years old; 18% (n = 167) were aged between 35 and 44 years old, 17% (n = 159) between 18 and 24 years old, 13% (n = 118) between 55 and 64 years old, 12% (n = 109) between 45 and 44 years old, and 3% (n = 29) above 65 years old. Regarding the responders’ expertise, half of the observers (n = 452) were laypeople, 24% (n = 214) were dentists, 18% (n = 168) were orthodontists, and 8% (n = 76) were undergraduate students in dentistry.

The descriptive statistics for the VAS scores attributed by the observers to the three calibration profiles, to the ideal profile, and to the Treatment group and the Control group profiles are reported in Table 2.

Table 2 Descriptive statistics for the overall VAS scores, divided by the observer's expertise.

The independent samples T-test adjusted for the lack of homoscedasticity revealed a statistically significant difference between the difference in T1–T0 scores assigned by all the observers overall to the Treatment group compared to the Control group (Table 3). The random intercept model revealed a significant interaction of the presence/absence of treatment, the gender of the observers, and the observers’ coherence on the VAS scores attributed to the profiles (Table 4).

Table 3 Statistical comparison for the T1–T0 difference in VAS scores from all the observers between the Treatment group and the Control group.
Table 4 Random intercept model.

Regarding the coherence analysis, a total of 27.4% of the observers were incoherent. The distribution of incoherent observers was significant among gender, but not among different types of expertise (Supplementary File 1). Interestingly, the scores attributed to the “ideal” profile by coherent and incoherent observers showed a contrasting trend (Supplementary File 2). The one sample test of proportion (Supplementary File 3) revealed a distribution significantly different from 0.5 (mean 0.92, z = 13.4, p < 0.001).

Discussion

To the best of our knowledge, this is the first study of this kind to collect a large sample of observers and have them evaluate the attractiveness of a meaningful sample of treated and untreated profiles. The Treatment and the Control group were comparable in terms of age, gender distribution, and skeletal Class II severity evaluated through the ANB angle, thus reducing the number of possible confounders. Treatment with a Sander bite-jumping appliance was chosen because, according to a systematic review, it is the removable appliance that seems to produce the greatest increase in Condilion-Pogonion mandibular length12.

The use of black and white silhouettes has been suggested by several authors, because they overcome the effect of confounding factors like hair, eyes, skin color, makeup etc.19,26, and are especially suitable to evaluate the sagittal position of the jaws11. In addition, in the present study the silhouettes were cropped at the level of the subnasale point, to allow the observers to focus on the lower third of the face, without being distracted by the shape and dimension of the nose27 or the forehead.

The use of VAS scores is a simple method that is easily understood and applied by the observers and widely approved by the scientific community for tasks like the one pursued in the present study11,26.

In general, the mean scores attributed by the observers were in the middle of the VAS range. This could have been expected, since it is known that skeletal Class II profiles are considered less attractive than Class I profiles2,6. On the other hand, it was surprising that the “very pleasant” profile among the three calibration images and the “ideal” profile also scored slightly above the average (Table 2). Indeed, only 54 observers attributed a score of 10 to the “very pleasant” calibration profile, and only 13 observers out of 612 assigned a score of 10 to the “ideal” profile. Furthermore, there was a difference in scoring the “ideal” profile between the coherent and incoherent observers (Supplementary File 2). Indeed, the observers’ coherence had a significant effect on the VAS scores collected (Table 3), and according to the statistical analysis it is likely that the incoherence, as defined in the present work, was an expression of the inner scheme of categorization of each observer, rather than an expression of an inaccuracy in scoring. However, it must be noted that the meaning of the variable coherence can be subjected to different psychometric interpretations, which are beyond the scope of the present manuscript.

Overall, the Treatment group showed a greater T1–T0 increase in VAS scores than the Control group, based on all observers (Table 3). It is worth remarking that despite this effect seems small, it corresponds to the 9% of the average score of the “ideal” profile observed in the present study. This result suggests that functional treatment is effective at improving a patient’s facial attractiveness, compared to the absence of treatment. This is consistent with previous studies conducted on smaller samples9,10,19,24,28 and in contrast with another study that reported no differences between the pre-treatment and the post-treatment profiles11. There were no differences in facial attractiveness perception between dentists, orthodontists, undergraduates in dentistry, and laypeople. This finding is in contrast with previous studies3,9,21,22,29,30. On the other hand, other authors reported results similar to those of the present study6,11,19,24,31,32,33. Similarly, the observer’s age had no significant effect on the ratings assigned to the profiles. Interestingly, a strong effect of gender on the aesthetic perception of the observers was detected, with male observers tending to assign higher scores than females. This finding is in contrast with that of O’Neill et al.11 who found no differences between male and female raters.

Even though it is possible to conclude with adequate confidence that Class II treatment with functional appliances improves the aesthetic perception of the profile by both laypeople and professionals, although with gender differences, an evaluation of the impact on patient’s quality of life would be advisable so that this aspect can be taken into account when recommending two-phase treatment.

Limitations

The rate of complete responses was lower than expected, but this was understandable since the observers were asked to evaluate a very large number of profiles (n = 84). Missingness was assumed missing completely at random (MCAR) based on the distribution shapes of random sub-samples bootstrapped from the overall sample.

Disseminating a questionnaire online provided several advantages34, but on the other hand the responses were not collected in a standardized manner and it was impossible to control such factors like the time spent on each image, or whether the observers completed the questionnaires with the help of their peers. In addition, it was not possible to change the random order of the silhouettes for every observer due to technical limitations. Under this point of view, the coherence test provided assumed a crucial role. Dissemination of the questionnaire was limited to people living in Italy, because cultural differences could act as confounders26,35,36; although this strengthens the validity of our results, it reduces their applicability to other populations with a different cultural background.

The use of historical controls has been criticized by some authors because growth is influenced by secular trends37. However, it is the only ethically acceptable method and other authors have concluded that their use offers results comparable to those of a prospectively recruited sample38.

Conclusions

Skeletal Class II treatment with a Sander’s bite-jumping appliance is effective at improving the patient’s profile attractiveness, compared to untreated Class II controls. This improvement is perceived equally by dental professional with different levels of expertise and laypeople. On the other hand, a difference between gender was observed, with female observers assigning lower scores. Many observers showed incoherence when evaluating the three calibration profiles and the ideal profile; this finding was gender-related and had a significant impact on the results, but was considered an expression of grading ability rather than an inaccuracy in scoring.