Introduction

Age estimation, an important aspect of forensics and orthodontics, is often used when chronological age cannot be determined1. Indeed, estimating dental age in children is useful in several situations such as orthodontic treatment planning, forensic dentistry, and other clinical scenarios2,3. In living individuals, age estimation is a crucial and increasing forensic practice method due to widespread increases in individuals without identification documents and whose real age must clarified for criminal, civil, asylum, or old-age pension proceedings4,5,6,7,8,9. Age estimation is increasingly requested by judicial authorities to determine if the adult penal law should be applied according to legally-relevant age ranges10. Age estimation has also been used in professional sports, where age falsification could provide athletes with significant competitive advantages11.

Different methods are used to determine age using different measures and radiological examinations10, with the teeth and the hand-wrist commonly assessed. Teeth are one of the strongest structures in the human body and, together with the skeletal system, pass through a series of developmental changes that represent valid indices for age determination12,13,14,15,16. Skeletal maturity is based on radiography of specific structures such as the medial clavicular epiphysis cartilage17,18,19, pubic symphysis20, and the left hand-wrist area10. However, methods based on skeletal maturity are more variable and susceptible to error than methods based on tooth maturation21,22,23. Dental methods identify the stages of tooth mineralization in radiographs and code them according to predetermined scores24 or continuous measures13,25. The most common method for age estimation was published in 1973 by Demirjian, Goldstein, and Tanner24 and was subsequently modified by other authors. Demirjian’s method is based on eight developmental stages ranging from crown and root formation to apex closure of the seven left permanent mandibular teeth. A score is assigned at each stage and then the sum of the scores represents the subject’s dental maturity score (DMS). From this seminal paper, the DMS was used in regression equations to estimate the age of a subject.

Over the years, several different methods have been developed to increase the accuracy of age estimation. Technological developments in radiology have allowed more specific measurements to be made, increasing the accuracy of dental/skeletal maturation indicators26,27,28,29. There has also been a focus on refining age estimation methods to better predict chronological age13,25,30.

To consider a method "valid", it is necessary to proceed with its validation. Validation refers to the process of applying the age estimation method to a sample other than the one used to calibrate the method31. The sample can be external or a test set obtained by splitting the study sample into training and test sets. To evaluate the method’s validity, the distribution of errors between chronological and estimated age are then evaluated on this external sample or test set.

Inter-observer reliability is defined as the agreement between two or more observers, while intra-observer reliability is defined as the agreement of the same evaluator at two or more different time points. Cohen’s K statistic is commonly used for reliability assessments of categorical scales, while the intraclass correlation coefficient (ICC) or the concordance correlation coefficient (CCC) statistics are appropriate for continuous scales32.

Reference studies on forensic age estimation should report sex and ethnicity, two well-known factors associated with individual dental/skeletal maturity33,34,35, in addition to chronological age, bone age, the difference between bone age and chronological age, and intra-observer and inter-observer reproducibility36. While several literature reviews and meta-analyses have compared different age estimation methods3,37,38,39,40,41, to our knowledge there has yet to be a first meta-analysis also comparing validation and reproducibility. We aimed to assess the validity of age estimation methods based on bone or dental maturity indices and the reproducibility of these maturity indices, through meta-analysis of validation and reproducibility studies. Therefore, the Review questions are “What is the level of validity of age estimation methods based on bone and dental maturity indices? What is the level of reproducibility of bone and dental maturity indices?”

Results

Study selection

The literature search returned 51 articles from PubMed and 382 from Google Scholar (total 433). After removing duplicates (28 articles), the titles and abstracts were separately screened by two authors (VM and CM) to leave 75 eligible articles. After reading the full text, 59 articles were excluded because 31 articles did not validate the age estimation method; 10 articles focused only on assessing a threshold for 14- or 18-year-old subjects; and 18 articles had incomplete or unusable data.

Sixteen studies were therefore included in the qualitative synthesis, and seven studies, which complied with the inclusion criteria, were also included after further examination of previous meta-analyses or systematic reviews to provide a total of 23 articles (Fig. 1).

Figure 1
figure 1

PRISMA flow diagram of the search results from the databases.

Characteristics of included articles

The characteristics of the 23 selected studies are detailed in Table 1. All studies adopted a cross-sectional design and were conducted in both university and hospital settings in 15 countries (Bosnia-Herzegovina42,43, Brazil44, China45,46,47, Colombia48, Egypt49, India50,51,52, Iran53, Italy54, South Korea55, Macedonia56, Malaysia57,58, Saudi Arabia59, Spain60,61, Sri Lanka62, and Turkey63,64). Sample sizes ranged from 7052 to 264161subjects who underwent orthopantomography (21 articles42,43,44,45,46,47,48,49,50,51,53,55,56,57,58,59,60,61,62,63), or wrist and hand X-rays (2 articles52,54). The age range was from 152 to 24 years57, and most studies (17 out of 2342,43,44,46,48,49,50,51,53,54,55,56,58,59,60,63,64) enrolled subjects aged 16 years or younger.

Table 1 The studies included in the meta-analysis. All studies reported the type of examination as “orthopantomography” except two52,54.

Nine different age estimation methods were used, with a clear predominance of the Demirjian approach or its modification (16 out of 2342,45,46,47,50,53,54,55,56,58,59,60,61,62,63,64) and Willems (13 out of 2343,44,46,47,49,50,53,55,56,57,58,60,62). Other methods were used less or only once (Cameriere, 5 out of 2343,48,49,53,58; Haavikko, 4 out of 2343,50,58,64; Smith, 1 out of 2353; Nolla 7 out 2350,51,58,60,61,63,64; Chaillet, 1 out of 2355; Blenkin and Evans, 1 out of 2362; Greulich and Pyle, 2 out of 2352,54).

Sixteen studies provided complete data for both mean errors and examiner agreements, while eight studies report mean errors in age estimation without complete or usable data regarding the intra-or inter-observer agreement. The precision of the estimation methods was highly variable, with a mean error ranging from a maximum precision of − 0.02 years using the Cameriere method applied to males43 to a minimum of − 2.96 years using the Haavikko method applied to females50. The inter-examiner agreement ranged between 0.73 and 1 for Cohen’s k/Fleiss’ k and between 0.84 and 1 for ICC; similarly, the intra-examiner agreement ranged between 0.82 and 0.99 for Cohen’s k and between 0.80 and 1 for ICC.

Study quality assessment (qualitative synthesis)

The risk of bias assessment for the selected studies is presented in Table 2 and illustrated in Fig. 2. All studies accurately described the patient selection procedure except for El Bakary et al.49 and Javadinejad et al.53, in which the procedure was not clearly explained, and Franco et al.44, in which the criteria were not reported, so these studies were classified as “unclear”. With respect to the index text, we considered any study that clearly expressed the method of analysis of the radiographs or the experience or number of observers making the measurements as “low” risk. Three studies49,53,59 did not provide enough information, while another study was not completely specific63. Four studies44,57,59,63 did not report how the chronological age was assessed (the reference standard in Fig. 2), and this was interpreted as a risk of bias since a person could be confused or lie about his age. All studies provided good information on flow and timing.

Table 2 Quality assessment performed using the QUADAS-2 instrument.
Figure 2
figure 2

Quality assessment obtained using the QUADAS-2 instrument for the 23 selected studies.

Despite the possibility of bias, no study had applicability concerns. All articles met the minimum criterion of regularity in the procedures, as defined by the PICOS/PECOS strategy66, and therefore were included in the analysis.

Meta-analysis of age estimation validity

Since we found only two studies based on bone maturation indices, we did not produce a meta-estimate of the mean error. Concerning the age estimation validity based on dental maturation indices, significant heterogeneity was found for both males and females (males: I2 = 99.6% [95% CI 99.6%; 99.7%]; τ2 = 0.54 [95% CI 0.38; 0.86]; females: I2 = 99.6% [95% CI 99.5%; 99.6%]; τ2 = 0.56 [95% CI 0.38; 0.88]) due to the large sample size and the precision of the included studies. As a result, a mixed-effects model was applied to calculate the pooled mean error of age estimation by sex. The pooled male mean error of the age prediction was 0.08 years (95% CI − 0.12; 0.29), and the pooled female mean error was 0.09 years (95% CI − 0.12; 0.30). Figure 3 shows the stratification by age estimation methods, which are also summarized in Supplementary Methods 1.

Figure 3
figure 3figure 3figure 3figure 3

Forest plots showing the pooled mean errors of the age predictions for males (A) and females (B) by method of age estimation.

Studies that implemented Nolla’s method had a mean error closest to zero with a slight overestimation: mean male age prediction error of 0.02 (95% CI − 0.37; 0.41) and mean female age prediction error of 0.03 (95% CI − 0.34; 0.41). Haavikko’s method was a less accurate method, with a mean error of − 1.12 (95% CI − 2.29; 0.06) and − 1.33 (95% CI − 2.54; − 0.13) for males and females, respectively. Cameriere’s method also underestimated the chronological age and was the only method with a higher absolute mean error for males than females (males: − 0.22 [95% CI − 0.44; 0.00]); females: − 0.17 [95% CI − 0.34; − 0.01]). Generally, Demirjian’s and Willems’s methods tended to overestimate chronological age in both males (Demirjian: 0.59 [95% CI 0.28; 0.91]; Willems: 0.07 [95% CI − 0.17; 0.31]) and females (Demirjian: 0.64 [95% CI 0.38; 0.90]; Willems: 0.09 [95% CI − 0.13; 0.31]).

We included three studies in the “others” category53,55,62 for age estimation based on dental maturity (Blenkin & Evans, Chaillet and Smith). These methods underestimated chronological age for both sexes (males: mean = − 0.26; 95% CI [− 0.65; 0.12], females: mean = − 0.29; 95% CI [− 0.61; 0.02]).

For both males and females, the PI overlapped zero for all methods, rendering the difference between estimated and chronological ages not statistically significant. For both genders, Cameriere’s method showed the smallest PI, while the Haavikko and other methods had the widest intervals.

Meta-analysis of intra- and inter-examiner agreement

It was not possible to obtain a pooled Cohen’s k (or Fleiss’ k) due to a lack of information on the standard error or variance in the examined studies. Therefore, we compared only studies with ICCs and the studies reporting only the global reliability without stratification by gender. The meta-analytic pooled estimates of inter-examiner and intra-examiner agreement are summarized in Fig. 4.

Figure 4
figure 4

Forest plots showing the pooled inter-examiner (A) and intra-examiner (B) agreement.

No heterogeneity was observed in inter-examiner (heterogeneity: Q = 5.78, p = 0.888) and intra-examiner (heterogeneity: Q = 9.11, p = 0.611) agreement, so a fixed-effects model was used. For inter-examiner agreement, the ICCs ranged from 0.89 to 0.99, and the meta-analytic pooled ICC was 0.98 (95% CI 0.97; 1.00), which was close to perfect reliability. Concerning intra-examiner agreement, the ICCs ranged from 0.90 to 1.00, and the meta-analytic pooled ICC was 0.99 (95% CI 0.98; 1.00), which was also close to perfect reliability.

Discussion

Age estimation represents one of the most important aspects of dental/skeletal analysis and forensic anthropology, playing a key role in human identification, both in living subjects and to establish identity in human remains1,2,29. This meta-analysis provides a comprehensive overview of the current literature on the validity of age estimation methods and reproducibility of maturity indices, in particular those based on dental maturation. Although bone age has been widely used, we found only 2 validation studies on methods based on bone maturity indices that met our inclusion criteria. This low frequency could be due to the evidence that bone maturity indices suffer more from environmental factors than dental ones23 and therefore it could be proper to validate each index only in the population in which it is built. The 21 studies on dental maturity indices identified were conducted in different countries with the aim of validating specific methods of age estimation in specific populations. Although the age estimation methods were applied to different populations, the meta-analysis results, stratified by gender and methods, showed similar accuracy. In fact, for both males and females, the prediction intervals obtained for each method spanned zero, indicating that, despite the different prediction intervals and different target populations, all methods can be considered accurate. Significant heterogeneity between studies was observed for both genders as a consequence of the large sample size of the studies and hence of the high level of precision of error estimates. Using a meta-regression model, we investigated whether this heterogeneity might be further explained by differences in characteristics of the studies or study populations such as type of method, publication year, ethnicity, mean age of the study sample, and impact factor of the journal; the I2 index still remain very high (99.2%) for both genders (data not shown). The strategies adopted to take into account the heterogeneity between the studies are the estimation of random-effect models and the estimation of prediction intervals to detect a range in which the validity of further studies is expected to be included based on current evidence67.

The studies that validated Nolla’s method had a mean age estimation error closest to zero for both males (0.02 years) and females (0.04 years), while Cameriere’s method had the narrowest prediction interval (male PI [− 1.07; 0.63]; female PI [− 0.82; 0.47]). Of the selected studies, Demirjian’s method and its revisited version by Willems were the most frequently used methods for age estimation due to their ease of use, high reproducibility, and accuracy. Both methods tended to overestimate chronological age in males and females, but Willems’ method had a narrow prediction interval, between − 0.95 and 1.09 in males and − 0.81 and 0.99 in females, compared with Demirjian (male prediction interval [− 0.83; 2.01]; female prediction interval [− 0.54; 1.81]).

The Haavikko method had the highest variability, with a prediction interval ranging from − 6.88 to 4.65 for males and from − 7.24 to 4.57 for females. This might be due to the variability in dental maturation among subjects of different ethnic origin1, since Haavikko’s method is calibrated on Finnish children, whose dental maturation seems to occur earlier68. Recently, Butti et al.69 and Mohammed et al.50 reached the same conclusion that Haavikko’s method is unsuitable for both Italian and Indian children.

With respect to method reliability, our results showed pooled estimates of reproducibility values close to perfect reliability (about unity), indicating that the methods are highly repeatable by expert examiners. This high reproducibility might be due to positive publication bias, as studies reporting good reliability are more frequently available in the literature than studies reporting poor or no reliability70.

The strengths of our research are the adequate number of studies included, the precision of pooled mean errors, and the comprehensive evaluation of all methods and indices based on dental maturity for which, respectively, the validity and reproducibility measures were available in literature. To our best knowledge, this is the first meta-analysis that simultaneously evaluated the validity of dental age estimation methods and reproducibility of maturity dental indices, thereby allowing more informed and safer choices in all medical and legal fields requiring these methods. Finally, the quality assessment of the selected studies was very high: only 10% of studies had an unclear risk or high risk of bias without any concerns about applicability.

However, our evaluation also has some limitations and it shows a partial picture of validity and reproducibility of age estimation methods, due to the strict exclusion criteria applied in order to provide unbiased meta-estimates. We excluded articles without information on both validity and reproducibility outcomes, articles not written in English or Italian, and those where it was impossible to obtain pooled reproducibility estimates of Cohen’s kappa or the ICC due to a lack of information on the variability measure. In addition, some studies used inappropriate methods to estimate reproducibility, as discussed in Ferrante et al.71. Lastly, after reading the full texts, we excluded several studies (31 out of 75) with the word “validation” in the title or abstract but that used an inadequate approach to validate the method or no validation at all.

In conclusion, since only two studies based on bone maturity indices reported the validation and reproducibility analysis, it was not possible to perform a meta-analysis for them. All studies reporting methods based on dental maturity indices, which underwent a validation process, were considered in this review and for each method the difference between estimated and chronological age was not significantly different from zero years, highlighting a high validity. Nevertheless, there was a high degree of variability in the precision of the prediction intervals (research focus 1; Supplementary material “Methods”). Furthermore, a high intra- and inter-observer reproducibility of dental maturity indices was observed (research focus 2; Supplementary material “Methods”). The Nolla and Cameriere methods might be recommended as preferred approaches, although the Cameriere method was validated on a smaller sample size than Nolla’s and it requires further testing on additional populations to better assess the mean error estimates by sex. In the development of new methods of age estimation, it will be important to apply rigorous validation and publish a minimum dataset that ensures comparability of validity and reliability between different studies.