Abstract
The aim of the study was to investigate how to improve the forecasting of craniofacial unbalance risk during growth among patients affected by Class III malocclusion. To this purpose we used computational methodologies such as Transductive Learning (TL), Boosting (B), and Feature Engineering (FE) instead of the traditional statistical analysis based on Classification trees and logistic models. Such techniques have been applied to cephalometric data from 728 crosssectional untreated Class III subjects (6–14 years of age) and from 91 untreated Class III subjects followed longitudinally during the growth process. A cephalometric analysis comprising 11 variables has also been performed. The subjects followed longitudinally were divided into two subgroups: favourable and unfavourable growth, in comparison with normal craniofacial growth. With respect to traditional statistical predictive analytics, TL increased the accuracy in identifying subjects at risk of unfavourable growth. TL algorithm was useful in diffusion of information from longitudinal to crosssectional subjects. The accuracy in identifying highrisk subjects to growth worsening increased from 63% to 78%. Finally, a further increase in identification accuracy, up to 83%, was produced by FE. A ranking of important variables in identifying subjects at risk of growth worsening, therefore, has been obtained.
Introduction
Data Mining and Machine Learning are mathematical procedures focused on developing algorithms and logical statements that can learn from the data and make subsequent predictions^{1,2,3}. These algorithms have been used successfully to extract meaningful information highlighting important, potentially unintuitive system behaviours in various domains such as econometrics, aerospace, robotics, finance, weather forecasting, and biomedicine^{3,4}. In medicine, research on computeraided diagnosis from artificial intelligence began to be applied with hopes that difficult clinical diagnostic and prognostic problems might yield to mathematical formalism, and that computers might have active role as consultants^{5}. Over the years it became clear that whenever the process is not solely stochastic, machinelearning algorithms could provide some insight into the likelihood of which outcome will occur (the odds, not the absolute certainty) by identifying new, highorder concepts and relations within components of biomedical pathways and generating automated hypotheses within syndromes^{6}. Important tasks in bioinformatics are the prediction of a response variable (i.e., disease status of a patient), based on a large number of predictors (i.e., signs and symptoms), and the reliable identification of the relevance of each candidate variable to patient outcome. Although Machine Learning tools have been developed to decipher the underlying patterns in thousands or millions of data, in the medical field these methods also have provided useful prognostic indications based on characteristics from a few dozen patients^{5,6,7,8,9}.
Malocclusions are isoforms of disharmony: they incur costs in terms of weakness of mechanotransduction, cumulative occlusal trauma, and outcome uncertainty about the ultimate facial appearance^{9}. Class III malocclusion is characterised by the protrusion of the lower dental arch and incorrect relation between the teeth of the two dental arches when they approach each other as the jaws close. The contemporary approach to treatment of Class III malocclusion is aimed to avoid orthognatic surgery and potential complications. The necessity at growth modification promoting mandibular restraining, maxillary growth, and dentoalveolar compensation has been particularly stressed^{9}. Therefore it would be highly desirable at the early treatment planning to have knowledge of complex morphological, growth, and function characteristics rendering any individual patient as either favorable or unfavorable for successful orthodontic treatment.
The contemporary approach to treatment of Class III malocclusion is aimed to avoid orthognathic surgery and potential complications. The necessity at growth modification promoting mandibular restraining, maxillary protraction, and dentoalveolar compensation has been particularly stressed^{9}. Therefore, it would be highly desirable at the early treatment planning to have knowledge of complex morphological, growth, and function characteristics rendering any individual patient as either favorable or unfavorable for successful orthodontic treatment.
Class III disharmony may show a significant tendency to worsen with growth if a patient is left untreated. Moreover a discrepancy between maxillary and mandibular growth can result in relapse of the malocclusion after the pubertal growth peak, and some patients may need corrective surgery at the end of craniofacial growth. Accordingly, the ability to predict the outcome of this disharmony would provide a significant advantage in orthodontic diagnosis and prognosis.
In clinical orthodontics traditional statistical models suffer from holding categorical data, correlate variables, missing values, and scarce amount of patient’s longitudinal data from which to derive the quality of future craniofacial growth and treatment outcomes. On the contrary, large numbers of subjects with unlabelled growth outcomes (i.e. crosssectional subjects) are readily available, though there are few ways to use these subjects for growth prediction^{9}. Only in longitudinal surveys are orthodontic researchers able to detect changes in the growth characteristics of the target population, at both group and individual levels. As a result, they can establish sequences of growth events. Moreover, crosssectional studies typically do not provide definitive information about causeandeffect relationships. Longitudinal data on untreated subjects with Class III malocclusion are rare, given the functional and aesthetic challenges. As the skeletal imbalance usually tends to get worse over time, individuals with this condition usually face treatment during childhood and adolescence. The intriguing characteristic of a Transductive Learning (TL) procedure (also known as semisupervised learning) resides in its capability of generating additional information derived from labeled (longitudinal) data through the spread of information to unlabelled (crosssectional) data, i.e. subjects whose outcome is unknown^{10,11,12,13,14,15}. Therefore, TL algorithms take advantage of unlabelled data, when used in conjunction with a small amount of labeled data, to produce considerable improvement in prediction accuracy. In each step of this methodology, a subset of unlabelled crosssectional subjects is labeled according to an iterative expectationmaximisation algorithm^{15}. Recently, Auconi et al.^{16} applied two commonly used statistical methods, Classification Trees (CT) and Discriminant Analysis (DA), to forecast the individual risk of unfavourable dentoskeletal growth among a particular set of untreated subjects affected by Class III malocclusion. Among 91 subjects, Auconi and coworkers found 28 “good growers” and 63 “bad growers”^{16}. Two cephalometric angles (SNA and PPMP see below) revealed to be strongly related to the risk of unfavorable growth. However, the ability of these methods to predict the risk among new, unseen Class III subjects was relatively modest (accuracy: 64%). The purpose of this study is to submit the cephalometric values of this same dataset to a sequence of procedures such as transductive learning, boosting, and feature engineering, with the aim to improve the interpretability of the growth model and the prediction of risk of unfavorable dentoskeletal prognosis in these subjects.
Material and Methods
Two samples of longitudinal and crosssectional untreated Class III subjects were analysed. The longitudinal sample consisted of 91 untreated Class III subjects (48 females and 43 males) who were enrolled in the Auconi’s study^{16}. Two lateral cephalograms were available for each subject. The mean age at T_{1} was 10.4 ± 2.0 years while the mean age at T_{2} was 15.4 ± 1.9 years. The crosssectional sample comprised 728 untreated Class III subjects (341 males and 387 females) with an age range from 7 to 13 years. This sample was enrolled previously in descriptive estimates of craniofacial growth in Class III malocclusion^{9,16,17,18,19}. To be included in this study, the subjects of both samples had to satisfy the following criteria: Caucasian ancestry, no previous orthodontic treatment, no congenitally missing teeth, and no craniofacial syndromes. Diagnosis of Class III malocclusion was based on accentuated mesial step relationship of the primary second molars, anterior crossbite, permanent first molar relationship of at least onehalf cusp Class III, Wits appraisal less than −2 mm, ANB angle less than 0 degree. This study was exempted from review by the Medical School Institutional Review Board of the University of Michigan, Ann Arbor (HUM00143467).
Cephalometric analysis
The quantities under consideration (as in similar analyses^{16}) are derived from the anatomy of the patient as shown in Fig. 1. They are based on the following 11 variables: S–N, S–Ar, NSAr, SNA, SNB, Co–Go, Go–Gn, N–Me, ArGoMe, Palatal Plane to S–N (PPSN) and Palatal Plane to Mandibular Plane (PPMP) defined below. In particular we have the following quantities with two measurements at ages T_{1} and T_{2} (measured in years):

ID: anonymised ID code unique to each patient.

Growth: a binary variable with values “Good” or “Bad”, determined on the basis of CoGnCoA.

SN: length of the anterior cranial base (SellaNasion) (mm).

SAr: distance from point Sella to point Articulare (mm).

NSAr: Saddle angle (NasionSella Articulare) (degrees).

SNA: anteroposterior position of the maxilla to the anterior cranial base (degrees).

SNB: anteroposterior position of the mandible to the anterior cranial base (degrees).

ANB: angle between Down’s points A and B (degrees).

PPSN: palatal plane to anterior cranial base angle (degrees).

PPMP: palatal plane to mandibular plane angle (degrees).

NMe: total anterior face height (mm).

CoA: maxillary length from Condylion to point A (mm).

CoGn: total mandibular length from Condylion to Gnathion (mm).

CoGo: length of mandibular ramus from Condylion to Pogonion (mm).

GoPg: length of mandibular body from Gonion to Pogonion (mm).

ArGoMe: Gonial angle (ArticulareGonionMenton) (degrees).

IMPA: lower incisor to mandibular plane angle (degrees).
We adopted the same method described in^{16} to follow the Sagittal Skeletal Imbalance (SSI) during the growth process, as the difference between Co–Gn and Co–A, calculated both in T_{1} and in T_{2} for all 91 longitudinal Class III subjects^{20}. To evaluate the progression of imbalance, the individual SSI of each Class III subject was compared, with the SSI standard values in a normal population (matched for age and gender) derived from the cephalometric atlas by Bhatia and Leighton^{21}. “Good growers” were defined those subjects who approached normal values during the growth process (T_{2}–T_{1} change), while “bad growers” were defined as those subjects who showed an increase of the difference with respect to normal values during T_{2}–T_{1} change. Out of the 91 subjects, 28 (30.8%) were found to be good growers, 63 (69.2%) bad growers. A rigorous statistical analysis of the distributions and correlations of the used variables is out of the scope of this paper, but it can be noted from the histograms in Fig. 4. that the ages are roughly distributed with a prevalence of younger subjects, while the rest of the variables are bellshaped. For a more detailed analysis of the population the reader is referred to Auconi et al. previous work^{16}.
Statistical Analysis
The longitudinal data at T_{1} were analysed with the following steps: 1) Classification Trees; 2) Regularized Logistic Models (LASSO); 3) Transductive Learning approach (Semisupervised Learning); 4) Boosting; 5) Feature Engineering. A logical scheme of this sequence of operation can be found in Fig. 5.
Classification Trees
A Classification Tree is a combination of mathematical and computational techniques that can aid in the description, categorisation, and generalization of a given set of data, to help identifying a strategy that most likely will reach a goal^{3,6,11}. In the present study, Classification Trees were applied with the aim of identifying the most important predictors (among the 11 cephalometric variables at T_{1}) for the good or bad craniofacial growth (categorical dependent variable). Trees were produced using the Python library scikitlearn^{3,22}.
Regularized Logistic Models (LASSO)
Logistic regression can be used to fit a predictive model of the relationship to an observed data set of binary decisions Y (dependent) and numerical observed variable X (independent). Logistic regression can be applied to model the dependence between X and Y as a posterior probability function. The observed variables form a linear combination with suitable coefficients, and the weighted sum is combined with a nonlinear function (the logistic) to obtain a value bounded between 0 and 1. To improve the generalisation error and interpretability, especially when there are limited samples, the model’s coefficients can be penalised to keep them as small as possible (“regularisation”). This approach can be accomplished by imposing an upper bound on the norm of the coefficient’s vector: the most common choices are the L^{2} norm, which constraints the sum of their squares, and the L^{1} norm, which pose a limit on the sum of their absolute values. In this framework the LASSO (Least Absolute Shrinkage and Selection Operator) uses a L^{1} penalty assumption itself. It can be proven that while the L^{2} norm just shrinks the regression coefficents, the L^{1} norms forces some of them so be exactly zero. This can be seen easily with a geometrical intuition on cartesian plane: a sum of squares penalty costrains the coefficients to stay in a disk shaped area, while a sum of absolute values forces them to stay in a diamond shaped area. In a higher dimensional space, the diamond becomes a rhomboid. These kind of shapes have many corners, so when the coordinate descent algorithm that minimize the loss function hits one of them, the coefficents corresponding to that coordinate becomes zero^{23}. Thus, the advantages of LASSO are both easier interpretation of the fitted model (sparsity) and computational convenience (convexity). The regularised logistic classification model was performed using the Python library scikitlearn.
Transductive Learning (TL) Approach (SemiSupervised Learning)
In many realworld scenarios concerning predictions it often is easy to collect large amounts of unlabeled (crosssectional) data, while it is much harder to assign labels that requires careful work by a domain expert. In orthodontics typically it is easy to collect crosssectional data, whereas following a patient longitudinally and assigning a label at the end of the observation period is much more difficult. In the last twenty years, several useful algorithms have been developed to learn from both crosssectional and longitudinal labelled data, such as SemiSupervised Learning, or Transductive Learning^{11,12,14,15}. To exploit the values from crosssectional sample, an algorithm based on Graph theory^{24,25,26,27} was used with the support of the implementation offered by the Python library scikitlearn. Each cephalometric observation was considered a vertex on the graph where an appropriate similarity measure (expressed by a radial basis kernel function) was defined among observations; this measure then was used to determine the weights related to the edges connecting each observation with the others. An iterative process by which information about true labels is spread between neighbours then was carried out until convergence or some stopping criterion. In this study the stopping criterion was chosen in order to spread the label information until the class balance in the crosssectional sample was the same as in the longitudinalsample, as this was the only reasonable assumpion to work with. For further mathematical and algorithmic details, the reader is referred to Chapelle et al.^{12} and Zhu^{15}. The parameters of the algorithm, such as the radial basis function bandwidth, have been chosen with the same rationale as the stopping criterion cited above. This approach allows exploiting the crosssectional dataset by augmenting the sample size without making use of oversampling techniques or completely “synthetic patients”^{28}. A comparison was carried out between simple linear classifiers, such as logistic regression (LASSO) and Decision Tree boosting algorithm^{13,25,26,27,28}. While simple linear models or a single Decision Tree offer great interpretability, these statistical approaches may fail to learn complex nonlinear relationships between input and output data^{28,29}.
Boosting
Every predictive model has its strength and weaknesses. If one creates lots of models in lots of different ways and then combine them together, the “ensemble” of models uses several learning algorithms to obtain better predictive performance. Ensembles tend to yield better results when there is a significant diversity among the models^{22,25,26,27}. The key concept of Boosting is to train the predictive algorithms in sequential rather than in parallel way. Every datum is weighted, and each new classifier is trained by giving more weight to previously misclassified data. In this way, the combination results are more efficient and the errors tend to decrease. This procedure allows to use very weak classifiers that produce a very strong classifier through Boosting^{22}). The reweighting procedure is guided by gradient descent on the logistic loss function. The number of trees in the model, which can be seen as the number of iterations of the procedure, is a parameter to be chosen with cross validation as all the other parameters of the trees construction. This has to be done as, in contrast with other ensembles of trees such as random forests^{30}, adding too many trees can lead to overfitting. Boosting can be interpreted as an optimisation algorithm that belongs to a family of Machine Learning procedures that converts weak predictive analytics to strong ones. The most common implementation of Boosting is AdaBoost^{13,22}. The Boosting method utilised in this study belongs to the family of Gradient Boosting, and a very efficient implementation is offered by the XGBoost library^{26}.
Feature Engineering
In Data Mining applications a common way to improve accuracy is combining existing variables to create new ones. Features that look irrelevant in isolation may be relevant in combination. A simple powerful procedure, when using linear classifiers to introduce nonlinearity terms, is to add polynomial features, such as squares or other given functions of the existing variables, or multiplications between feature pairs, often called interaction terms. Linear combinations also can be used, with careful attention when considering models that suffer from collinearity. This process often is called Feature Engineering (FE)^{10,22}. For the Data Mining practitioner, the guidance in FE application is dictated by prior knowledge from a domain expert. FE process produces a new list of variables to be used. We added the age at T_{1} in the data set, and we aggregated two differences GoGn–SN, SNB–SNA as proxy of horizontal maxillomandibular imbalance, and a sum of ArGoMe + PPMP + PPSN angles as a proxy of vertical imbalance. To reduce the clutter in the ranking of variable importance, some highly correlated variables were removed, this operation was guided by prior domain knowledge about craniofacial growth.
Results
By using a single Classification Tree on cephalometric data of 91 longitudinal Class III subjects, we obtained an accuracy score of 63%, as occurred in the previous work of Auconi et al.^{16}. Next, using the combination of TL preprocessing, a gradient Boosting classifier, and a logistic model LASSO, we observed improved accuracy scores from 63% to 78% (See Table 1). To ensure a fair comparison, the input data to the logistic model was standardized (zero mean, unit variance), while input data for boosting were the original ones. This is done because LASSO can suffer from numerical problems when the dynamic range of the variables is very different and in this case it spans two orders of magnitude. Moreover without standardization the resulting coefficents would depend on the unit of measure of the variables, making interpretation harder. On the other hand the treebased boosting procedure is independent from the features’ scaling, and it does not suffer for this numerical problems. The f_{1} score is reported along with accuracy, precision, and recall in Table 1. The f_{1} score is the harmonic mean of precision (positive predictive value, the fraction of relevant instances among all the retrieved instances) and recall (the fraction of relevant instances that have been retrieved over the total amount of relevant instances). A mean ROC curve across the folds of the 10fold cross validation is reported in Fig. 6 for the best found model (GB with TL and FE). This curve, built by varying the probability threshold used to decide for “Good” or “Bad” classes, allow the practioner to further tune the predictions to be more or less conservative with false negatives as needed. These results increased the ability to predict patients at risk of unfavorable growth reported in^{16}, and offered as a sideproduct a classification by TL on the overall unlabelled crosssectional dataset. Moreover, the procedures allowed the estimation of the generalization error using a 10fold cross validation. In addition, the results confirmed that a simple linear model provided a modest accuracy score of 62%. To assess statistically the significance of the test, a Welch test has been performed on the K = 10 outcomes to check the mean equality hypothesis. The test provides a pvalue < 0.001. The variable’s importance ranking (Fig. 2) obtained through the Boosting procedure can be used by the practicing orthodontist to explore the importance of the variables in affecting the model predictions. The variable’s importance ranking (Fig. 2) obtained through the Boosting procedure can be used by the practicing orthodontist to explore the importance of the variables in affecting how the model uses the variables to make its predictions. The practioner then can reason about these rankings and verify if they have a match with his/her experience and the literature. As far as we, as practioners, can say not only PP–MP and SNA angles, but also NSAr and SN starting values were important features in differentiating the quality of craniofacial growth among Class III subjects. The Feature Engineering procedure obtained by the aggregation of some cephalometric measures to make them more expressive about the risk of maxillomandibular imbalance produced a further increase in accuracy of the model (up to 83%; Table 2). The same procedures as in the Table 2 was carried out for both the Gradient Boosting classifier and the Logistic L^{1} (LASSO). After FE, a new ranking of variables was obtained (Fig. 3). The horizontal skeletal imbalance GoGn–SN (GoGn minus SN) was another important finding in characterising growth. Class III subjects showing early maxillomandibular horizontal imbalance tend not to recover the gap during the growth process. The little predictive information importance of the aggregation of three angles of vertical imbalance ArGoMe, PPSN, and PPMP was noteworthy, while the relevance of the single PP–MP angle was confirmed. Likely, among the skeletal angles of vertical imbalance, the signal of unfavourable growth is nested in the latter angle. The orthodontic practitioner can find the predictive algorithm derived from this study and free to use for their patients at the following webpage http://malocclusion3.pythonanywhere.com/.
Discussion
Biomedical researchers and practitioners always have dreamed of prognostic predictive models that bind the observable reality with its mechanism of origin, especially if made up of a small number of easily identifiable parameters. In order to progress, a medical discipline must have data and reasoning commensurate with the complexity of the phenomena of interest. Data Mining and Machine Learning are disciplines of data science that aim to uncover unknown relationships and patterns from datasets whenever this is possible^{4,5,28,29,31}. From the beginning, these approaches to medicine have been sources of controversies: the main doubts concerned to what extent medical knowledge could be truly compressed and automatically coded into rules, and whether these aggregates could lead to an unavoidable amount of oversimplification^{4,5}. Actually, early techniques seemed to have modest practical value in handling the complexity of biomedical features appropriately, and in recognizing clinical variation of a disease, such as change in severity over time. It was difficult to capture relevant diagnostic information in strict rulebased systems with an acceptable degree of reliability, or to cope with rules interacting with many other rules in a complex and unpredictable way^{32}. These concerns have been overcome due the construction of reliable innovative models such as Decision Trees, Bayesian networks, and Boosting procedures. While it must be remembered that inference is always done in a probabilistic setting, even more and more realiable algorithms and sophisticated validation procedures can still be failing^{6,9,10}. When considering scenarios in clinical orthodontics, it is easy to claim that hypotheses about causal laws that affect the onset and the progression of disharmonies are probabilistic, and hence fallible^{33,34}. In current clinical practice, the orthodontist’s judgments are constrained by limited and subjective experience, incomplete knowledge, and oversimplified taxonomy of malocclusions that underestimate relevant individual qualitative aspects of disharmonies. Causes can be reversed, cause and effects are interchangeable, treatment effects can be the result of multiple causes, and the learned cause does not necessarily imply to be the only cause^{18}. Consequently, orthodontic researchers and practitioners had to develop a kind of “inexact reasoning” based on a set of hypotheses concerning the probabilistic relationships between combinations of early orthodontic signs, malocclusion progression, treatment decisions, and outcomes^{29,31,34}. Practicing orthodontists must consider questions like: “Can I find correlations between specific craniofacial characteristics and patterns of future growth?”, “What are the most influential skeletal sites useful for my treatment purposes?”, and “What cooccurrence of dentoskeletal characteristics make my treatment difficult?”. With the progression of treatment and better understanding of the characteristics of the patient, paradoxically these questions become more complex and difficult to answer. It is within this context that the cognitive tasks of ML could play a role in orthodontic diagnosis and prognosis. At the beginning, various algorithms are applied to generate a prediction model to find relationships that correlate with events and nonevents^{3}. The success of the predictive model derives from the accuracy of this process; there often are several parameters that need to be set. The first stage of the prognostic process involves making an initial judgment about whether a Class III patient is likely to have a bad/good growth progression, based on previous experience and knowledge of literature. Unlike to the medical practitioner, the orthodontist does not have the option of exploiting a posttest probability (such as, for example, an exercise stresstest for the coronary disease). Predictive models, by their very nature, are based on the past and assume that the patterns of the past are repeated. In orthodontic discipline this assumption is obviously problematic. In the present work we reanalysed the same longitudinal data set of Auconi et al.^{16}, exploiting the combination with a crosssectional data of 728 untreated Class III subjects (see Table 3 for the complete correlation matrix of the variables considered) through a TL learning procedure. Subsequently, we submitted the data to a B and FE procedures. This sequence increased the predictive capability of adverse growth from 62% to 83%, through a further deepening of the data structure and a careful matching between longitudinal and crosssectional cephalometric characteristics (see in Fig. 7 the histograms and scatter plots by age group). Despite the increasing popularity of several predictive models from Machine Learning applied to medicine, some statisticians have argued against the superiority of these sophisticated methods in comparison to early easier methods such as Discriminant analysis (the “illusion of progress”)^{35}. Moreover, as already has happened in other biomedical fields, also in orthodontic discipline there is the concern that the allure of novel computational investigative methods could lead to misuse of information and inconsistent clinical analyses, further distracting the clinician from properly looking at the morphological and functional individuality of the patient^{35,36,37,38,39}. This work shows that constructing an appropriate predictive model of Class III progression, and providing a more indepth understanding of critical issues related to dentoskeletal disharmonies during the growth process can support the daily work of clinical orthodontist, reducing the bias of subjective judgment.
Data Availability
The orthodontic practitioners can find the predictive algorithm derived from this study and free to use for their patients at the following webpage http://malocclusion3.pythonanywhere.com/.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.
Sullivan, R. Introduction to Data Mining for Life Science (Springer, 2012).
 2.
Russel, S. J. & Norvig, P. Artificial Intelligence: A Modern Approach (Prentice Hall, 2009).
 3.
Finlay, S. Predictive Analytics, Data Mining, Big Data (Palgrave, 2014).
 4.
Kononenko, I. Machine Learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine 23 (2001).
 5.
Sajda, P. Machine learning for detection and diagnosis of disease. Annual review of Biomedical Engineering 8 (2006).
 6.
Bellazzi, R. & Zupan, B. Predictive Data Mining in clinical medicine: current issues and guidelines. International Journal of Medical Informatics 77 (2008).
 7.
Barabási, A. L., Gulbahce, N. & Loscalzo, J. Network medicine: a networkbased approach to human disease. Nature Reviews Genetics 12, 56–68 (2011).
 8.
Forestier, G. & Wemmert, G. Semisupervised learning using multiple clustering with limited clustering data. Information Sciences 361, 48–65 (2016).
 9.
Baccetti, T., Franchi, L. & McNamara, J. A. Jr. Growth in the untreated Class III subjects. Seminars in Orthodontics 13, 130–142 (2007).
 10.
Kauser, A. Analysis of data mining tools for disease prediction. Journal of Pharmaceutical Science and Research 9, 1886–1888 (2017).
 11.
Domingos, P. A few useful things to know about machine learning. Communications of the ACM 55, 78–87 (2012).
 12.
Chapelle, O., Scholkopf, B. & Zien, A. The semisupervised learning book. (MIT Press, Cambridge, MA, 2006).
 13.
Salzberg, S. On comparing classifiers: pitfall to avoid and a recommended approach. Data Mining and Knowledge Discovery 1, 317–327 (1997).
 14.
Freund, Y. & Schapire, R. A decisiontheoretic generalization of online learning and an application to Boosting. Journal of Computer and System Sciences 55, 3–43 (1997).
 15.
Zhu, X. Semisupervised learning literature survey. typeTech. Rep. TR 1530 (Computer Sciences, Univeristy of Wisconsin, Madison, 2007).
 16.
Auconi, P. et al. Understanding interactions among cephalometrics variables during growth in untreated Class III subjects. The European Journal of Orthodontics 39, cjw084 (2017).
 17.
Auconi, P., Scazzocchio, M., Defraia, E., McNamara, J. A. Jr. & Franchi, L. Forecasting craniofacial growth in individuals with Class III malocclusion by computational modeling. European Journal of Orthodontics 89, 207–213 (2014).
 18.
Scala, A. et al. Using networks to understand medical data: the case of Class III malocclusions. PloS one 7, e44521 (2012).
 19.
Reyes, B., Baccetti, T. & McNamara, J. A. Jr. An estimate of craniofacial growth in Class III malocclusion. The Angle Orthodontist 76, 577–584 (2006).
 20.
McNamara, J. A. Jr. A method of cephalometric evaluation. American Journal of Orthodontics 86, 449–470 (1984).
 21.
Bathia, S. N. & Leighton, B. C. A Manual of Facial Growth. (Oxford University Press, Oxford (UK), 1993).
 22.
Hastie, Trevor & Tibshirani. The Elements of Statistical Learning (SpringerVerlag, 2009).
 23.
Hastie T., Tibshirani R. & Wainwright M. Statistical Learning with Sparsity (Springer–Verlag, 2015).
 24.
Caldarelli, G. ScaleFree Networks (Oxford University Press, 2007).
 25.
Fatima, M. & Pasha, M. Survey of machine Learning algorithms for disease diagnostic. Journal of Intelligent Learning Systems and Applications 9, 1–16 (2017).
 26.
Chen, T. & Guestrin, C. A scalable Tree Bosting System. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (ACM, 2016).
 27.
Schapire, R. E. The strength of weak learnability. Machine Learning 5, 197–227 (1990).
 28.
Chawla, N. V. Data Mining for imbalanced dataset: an overview. In Maimon, O. & Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook (Springer, 2005).
 29.
Henrion, M., Breese, J. S. & Horvitz, E. J. Decision analysis and expert systems. Artificial Intelligence Magazine 4, 64–91 (1991).
 30.
Breiman, L. Random Forest. Machine Learning 45, 5–32 (2001).
 31.
Pis, P. & Mesiar, R. Fuzzy model of inexact reasoning in medicine. Computer methods and Programs in Biomedicine 30, 1–8 (1989).
 32.
Buntine, W. A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering 8, 195–210 (1996).
 33.
Scutari, M., Auconi, P., Caldarelli, G. & Franchi, L. Bayesian Networks Analysis of Malocclusion Data. Scientific Reports 7, 15236 (2017).
 34.
Vig, P. Reflection on the rationality of orthodontics. In: Science and Clinical Judgment in Orthodontics, Vig, K. & Ribbens, K. A. (eds) 9, 15–61 (1986).
 35.
Hand, D. Classifier technology and the illusion of progress. Statistical Science 21, 1–14 (2006).
 36.
Noroozi, H. Orthodontic treatment planning software. American Journal of Orthodontic and dentofacial Orthopedics 129, 834–837 (2006).
 37.
Friedman, J. Stochastic gradient boosting. Computational Statistics and Data Analysis 38, 278–367 (2002).
 38.
Powers, D. Evaluation: From Precision Recall and FMeasure to ROC Informedness, Markedness & Correlation. Journal of Machine Learning Technologies 2, 37–63 (2011).
 39.
Shtulman, A. Scienceblind. Why our intuitive theories about the world are so often wrong. (Basic Books, NY, 2017).
Author information
Affiliations
OnAIR Ltd, Genoa, Italy
 Enrico Barelli
 & Ennio Ottaviani
Department of Mathematics, Università degli Studi di Genova, Genoa, Italy
 Ennio Ottaviani
Private Practice of Orthodontics, Roma, Italy
 Pietro Auconi
IMT School for Advanced Studies, Piazza San Francesco 19, 55100, Lucca, Italy
 Guido Caldarelli
Istituto dei Sistemi Complessi CNR, Unità Sapienza, Dip. Fisica, P.le A. Moro 2, 00185, Rome, Italy
 Guido Caldarelli
ECLT, San Marco 2940, 30124, Venice, Italy
 Guido Caldarelli
Department of Experimental and Clinical Medicine, Orthodontics, Università degli Studi di Firenze, Florence, Italy
 Veronica Giuntini
 & Lorenzo Franchi
Department of Orthodontics and Pediatric Dentistry School of Dentistry, University of Michigan, Ann Arbor, Michigan, USA
 James A. McNamara Jr.
 & Lorenzo Franchi
School of Medicine and Center for Human Growth and Development, The University of Michigan, Ann Arbor, Michigan, USA
 James A. McNamara Jr.
Authors
Search for Enrico Barelli in:
Search for Ennio Ottaviani in:
Search for Pietro Auconi in:
Search for Guido Caldarelli in:
Search for Veronica Giuntini in:
Search for James A. McNamara Jr. in:
Search for Lorenzo Franchi in:
Contributions
All the author contributed equally to the determination of the scientific questions and to the interpretation of the results of this analysis. L.F., V.G. and J.A.Mc. provided the data, E.B. and E.O. made the numerical analysis and wrote the free software available to the readers. All the authors contributed equally to the writing and reviewing of the manuscript.
Competing Interests
The authors declare no competing interests.
Corresponding author
Correspondence to Guido Caldarelli.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.