Introduction

Kabuki syndrome (KS) is a rare genetic disorder, with an estimated prevalence of 1:86,000 to 1:32,0001,2,3. The typical KS face includes long palpebral fissures associated with eversion of the lateral third of the lower eyelid; long and heavy lashes giving the impression of made-up eyes; broad, arched and interrupted eyebrows; broad, depressed nasal tip; and prominent, cupped ears1,2,4. Extra-facial features include mild to moderate intellectual disability, visceral malformations, skeletal dysplasia and immunological manifestations5. KS has been described in all ethnic groups6,7.

More than 80% of KS patients have a pathogenic variant in the coding regions of KMT2D (KS type 1, KS1, OMIM147920), and around 10% of patients have a pathogenic variant in the KDM6A gene (KS type 2, KS2, OMIM300128)8,9,10,11,12.

Improving syndrome screening in clinical genetics is a crucial challenge in reducing diagnostic wandering. In France, the 7000 rare diseases identified to date represent 4.5% of the population, half of which affect children under the age of 5 with 10% of deaths between 0 and 5. Around 50% of patients are not diagnosed, and for the remaining 50%, diagnostic wandering reaches an average of 5 years13. Diagnostic wandering is defined by the failure to define the precise cause of a disease after having performed all available investigations. Applications of Artificial Intelligence (AI) are increasing in healthcare14,15,16,17. The field of dysmorphology has been changed by these new methods, under the name of Next Generation Phenotyping (NGP)18. Publications comparing human performances to NGP are flourishing19,20,21,22, and some suggest that digital tools do it better than human experts in terms of diagnosis: Dudding-Byth et al.23 showed a better performance of NGP compared to clinicians in a group of ten genetic syndromes, not including KS; Rouxel et al.5 compared the performance of the DeepGestalt technology18 using the Face2Gene online tool (FDNA Inc. Boston, MA, USA) to the performances of clinicians trained in the recognition of KS1 and KS2.

The aim of this study was to develop a NGP model for the diagnosis of KS and for distinguishing KS1 from KS2. We trained and validated the model on a large national and international multi-center cohort of patients of all ages and ethnicities. The specificity of this approach was the integration of lateral pictures, including the outline of the cranial vault and the position of the ears, as well as frontal pictures and the morphology of the external ear.

Materials and methods

The study was approved by the Comité Éthique et Scientifique pour les Recherches, les Études et les Évaluations dans le domaine de la Santé (CESREES), №4570023bis, the Commission Nationale Informatique et Libertés (CNIL), №MLD/MFI/AR221900, the Institutional Review Board, Faculty of Medicine, Chulalongkorn University (IRB 264/62), and in accordance with the 1964 Helsinki declaration and its later amendments. Informed and written consents were obtained from the legal representatives of each child or from the patients themselves if they were of age.

Photographic dataset

We included most pictures from the photographic database of the Maxillofacial surgery and Plastic surgery department of Hôpital Necker—Enfants Malades (Assistance Publique—Hôpitaux de Paris), Paris, France. This database contains 594,000 photographs from 22,000 patients, and all pictures since 1995 were taken by a professional medical photographer using a Nikon D7000 device in standardized positions.

We included retrospectively and prospectively, from 1995 to 2023, all frontal and lateral pictures of patients diagnosed with KS. The photographs were not calibrated. All patients had genetic confirmation of KS (KMT2D or KDM6A). We excluded all photographs taken after any surgerical procedure that could have modified the craniofacial morphology. Multiple photographs per patient corresponded to different ages of follow-up. Duplicates were excluded.

Controls were selected among patients admitted for lacerations, trauma, infection and various skin lesions, without any record of chronic conditions. More precisely, follow-up for any type of chronic disease was considered as an exclusion criterion. The reports were retrieved using the local data warehouse Dr Warehouse24. For each patient, the best lateral view was included.

Data from five other medical genetics departments were also included according to the same criteria: (1) Montpellier University Hospital (n = 32), (2) Grenoble University Hospital (n = 1), (3) Tours University Hospital (n = 1), (4) King Chulalongkorn Memorial Hospital Bangkok, Thailand (n = 8), and (5) Lausanne University Hospital, Lausanne, Switzerland (n = 1).

Validation set

For designs №1 and №2, we randomly selected a group of individuals corresponding to 10% of the number of patients with KS, and the equivalent number of control patients. These patients were removed from the training set. The two sets were therefore independent.

Landmarking

We used three different templates based on 105 landmarks for the frontal views, 73 for the lateral views and 41 for the external ear pictures. We developed an automatic annotation model for each template following a pipeline including: (1) detection of the Region Of Interest (ROI) and (2) automatic placement of the landmarks.

For ROI detection, a Faster Region-based Convolutional Neural Network (RCNN) model was trained after data augmentation (images and their + 10° and  10° rotations), with a learning rate of 0.001, a batch size of 4, a gamma of 0.05 and 2000 iterations, optimized and split into two stages: ROI detection and determination of profile laterality.

(1) ROI detection—Faster RNN trained on 15,633 images, after data augmentation (images and their + 10° and − 10° rotations): 6186 frontal images (2062 × 3) and 9447 right and left profile images (3159 × 3). The batch size was 2, learning rate was 0.0025, and the maximum number of iterations was 2800.

(2) Determination of profile laterality—Pre-trained ResNet50 network25 using the Pytorch library26. The training images included 1570 left profiles and 1579 right profiles. The batch size was 16, an Adam optimizer27 was used with a learning rate of 0.001, a step of 7, and a gamma of 0.1, trained over 25 epochs.

For the automatic placement of landmarks, we used a patch-based Active Appearance Model (AAM) using the menpo library on Python 3.728. We have previously reported the relevance of this approach29. We used two-scale landmarking: the model for frontal pictures was trained on 904 manually annotated photographs, with a first stage of dimensioning (diagonal = 150), a patch shape of [(15, 15), (23, 23)] and 50 iterations and a second stage without resizing, with a patch shape of [(20, 20), (30, 30)] and 10 new iterations. The model for profile pictures was trained on 1,439 manually annotated photographs, with a first stage of dimensioning (diagonal = 150), a patch shape of [(15, 15), (23, 23)] and 25 iterations and a second stage without resizing, with a patch shape of [(15, 15), (23, 23)] and 5 new iterations. The model for ears was trained on 1221 manually annotated photographs, with a first stage of dimensioning (diagonal = 100), a patch shape of [(15, 15), (23, 23)] and 50 iterations and a second stage without resizing, with a patch shape of [(20, 20), (30, 30)] and 20 new iterations. All three models used the Lucas Kanade optimizer30.

Each automatically annotated photograph was checked by two authors blinded for the diagnosis, QH and MD, and landmarks were manually re-positioned when necessary, using landmarker.io31. The Intraclass Correlation Coefficient (ICC) was computed between the raters. ICC values greater than 0.9 corresponded to excellent reliability of the manual annotation32.

Geometric morphometrics

We performed Generalized Procrustes Analysis (GPA)33 on all landmark clouds using the geomorph package on R34. Since the data were uncalibrated photographs, ROI sizes were not available: shape parameters only were assessed and not centroid sizes. Procrustes coordinates were processed using Principal Component Analysis (PCA) for dimension reduction. We retained the principal components explaining 99% of the total variance in cumulative sum. The last 1% was considered as negligible information.

Texture extraction

We partitioned the frontal and profile pictures into key areas and applied textural feature extraction methods to each zone, allowing to check the results and determine which zone had contributed most to the diagnosis.

We defined 14 key areas that could potentially contribute to diagnosis: 11 on frontal views (right/left eyes, right/left eyebrows, glabella, forehead, nasal tip, philtrum, right/left cheeks, and chin) and 3 on lateral views (pre-auricular region, eye, and zygoma relief). Each zone was extracted automatically using the previously placed landmarks.

We used the Contrast Limited Adaptative Histogram Equalization (CLAHE) algorithm for histogram equalization, as previously reported before the use of feature extractors35,36. CLAHE enhanced contrast by evenly dispersing gray values37 and by reducing the influences of illumination during picture capture and of skin color. Kiflie et al. recommended CLAHE as a first choice equalization method38.

Gray-Level Co-occurrence Matrix (GLCM) methods, as proposed by Haralick39, are based on the estimation of the second-order joint conditional probability density functions, which characterize the spatial relationships between pixels. GLCM is commonly used in texture analysis40,41, for instance in radiomics on CT-scan or MRI images42,43,44 or for skin texture assessment45. In GLCM, the co-occurrence matrix contains information on entropy, homogeneity, contrast, energy and correlation between pixels. GLCM includes 28 features, taking into account the average and range for each item of information and for each zone, representing 28 × 14 = 394 textural features for each patient.

Stratification using metadata

The textural features and the geometric principal components were combined for further analysis. To consider associated metadata (age and gender) and the fact that we included more than one photograph per patient (that is the non-independence of the data), a mixed model was designed for each feature. The variables to be explained were the features (geometric and textural), with age, gender and ethnicity considered as explanatory variables. A random effect on age and individuals was introduced. The equation of the mixed model was:

$${\varvec{Features}}_{{{\varvec{i}},{\varvec{j}}}} \sim \alpha + age. \beta_{1} + gender.\beta_{2} + ethnicity.\beta_{3} + age.\beta_{1,i} + \varepsilon_{i,j}$$

where \(age.\beta_{1,i}\) corresponded to a random slope for age per individual, and \(\varepsilon_{i,j}\) was a random error term. We did not use an interaction term between age and gender and age and ethnicity as it did not increase the likelihood of the model. Age, gender and ethnicity are significant factors in dysmorphology46,47.

The residuals of each feature were computed to consider potential biases linked to the metadata:

$${\varvec{\varepsilon}}_{{{\varvec{i}},{\varvec{j}}}} = {\varvec{Features}}_{{{\varvec{i}},\user2{ j}}} - \alpha + age. \beta_{1} + gender.\beta_{2} + ethnicity.\beta_{3} + age.\beta_{1,i}$$

Classification model

The inputs to the model were the residuals from the linear models described above, for each geometric or textural feature. We used eXtreme Gradient Boosting (XGBoost), a supervised machine learning classifier, for all the analyses48. We chose a tree-based booster, and the loss function to be minimized was a logistic regression for binary classification. We set several hyperparameters to improve the performance and effect of the machine learning model: learning rate = 0.3, gamma = 0, maximum tree depth = 6. The model with the lowest error rate was chosen for analysis. We separated the dataset into a training set and a testing set, and a five-fold cross-validation was used to define the ideal number of iterations to avoid overfitting.

The chosen model with the ideal number of iterations was then used on the independent validation set to test performances, by plotting accuracy and AUC. The Receiver Operating Characteristics (ROC) curves were plotted in R using the plotROC package49. We used the DeepGestalt tool proposed by Face2Gene CLINIC on our validation set, to be able to compare its performance (accuracies).

Uniform Manifold Approximation and Projection (UMAP) representations

The residuals \(\varepsilon_{i,j}\) were represented using UMAP for visual clustering, a nonlinear dimension reduction technique50. We retained the residuals associated with features with a classification gain (in their cumulative sum) > 0.75 in the importance matrix associated with the XGboost model. A k (local neighborhood size) value of 15 was used. A cosine metric was introduced to compute distances in high dimensional spaces: the effective minimal distance between embedded points was \(10^{ - 6}\). The three conditions of UMAP, namely uniform distribution, local constancy of the Riemannian metric and local connectivity were verified. UMAP analyses were performed using the package umap on R51 (Fig. 1).

Figure 1
figure 1

Analysis pipeline, from the initial photograph to diagnostic probability. ROI, Region Of Interest; AAM, active appearance model; Faster RCNN, Faster Region-based Convolutional Neural Network; CLAHE, Contrast Limited Adaptative Histogram Equalization; GLCM, Gray-Level Co-occurrence Matrix; XGboost, eXtreme Gradient Boosting.

Classification designs

  1. 1.

    Design №1, syndrome diagnosis support: KS was tested against controls in a binary classification.

  2. 2.

    Design №2, genotype–phenotype correlations: KS1 and KS2 were tested in binary classifications.

  3. 3.

    Design №3, genotype–phenotype correlations: KS1 Protein-Altering Variants (PAVs) and Protein-Truncating Variants (PTVs) were tested in binary classifications.

Ethics approval

This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the CESREES (17/06/2021, 4570023).

Consent to participate

Written informed consent was obtained from the parents.

Consent to publish

The authors affirm that human research participants provided informed consent for publication of the images in Figs. 1, 4 and 7.

Results

Population description

Ranging between 1998 and 2023, we included 1448 frontal and lateral facial photographs, corresponding to 634 patients. The mean age was 7.2 ± 4.2 years and ranged from 0 to 40.2 years; 52% were girls. Ethnicity was 92% Caucasian, 6% African or Caribbean, and 3% Asian.

The control group comprised 1084 photographs, corresponding to 527 patients with a mean age of 7.0 ± 4.6 years. Fifty-four percent were girls and ethnicities were 93% Caucasian, 5% African/Caribbean, and 2% Asian.

The KS group comprised 364 photographs, corresponding to 107 patients with a mean age of 7.8 ± 6.7 years. Forty-two percent were girls and ethnicities were 85% Caucasian, 7% African/Caribbean, and 8% Asian. Seventy-eight percent of patients were KS1 (Table 1).

Table 1 Clinical description of the cohort.

Two patients had a genetically confirmed diagnosis of KS, but we had no information on the causal gene. We thus collected information on genetic variation for 105 KS individuals with 82 (78%) and 23 (22%) with variations in KMT2D (KS1) and KDM6A (KS2) respectively.

In the KS1 group, 74% of variants were PTVs, with 49% nonsense variants leading to a premature stop codon (24% non-sense, 24% frameshift) and 26% splice donor site variants. Eighteen percent were PAVs, with 17% missense variants and 1% in-frame indel.

In the KS2 group, 78% of variants were PTVs, with 43% nonsense variants leading to a premature stop codon (30% non-sense, 13% frameshift), 30% splice donor site variants and 4% a large deletion. Nine percent were missense PAVs (Table 2).

Table 2 Molecular description of the cohort.

Design №1 : KS vs controls

  1. 1.

    Phenotype

We confirmed the usual characteristics described in KS: high and arched eyebrows, long palpebral fissures, and large and prominent ears (Fig. 2).

Figure 2
figure 2

Average shapes in KS and controls and comparisons after Procrustes superimposition of frontal views, profile views, and external ears for three age groups. Blue = controls, Dark red = KS.

  1. 2.

    Classification

We were able to distinguish KS vs controls in the independent validation group with an accuracy of 95.8% (78.9–99.9%, p < 0.001). AUCs were comparable in the training set (0.994) and in the validation set (0.993) (Fig. 3, Table 3).

Figure 3
figure 3

(A) Empirical ROC curves (training set) for KS with AUC in design №1. (B) ROC curves (validation set) for KS with AUC in design №1. AUC, area under the curve; KS, Kabuki Syndrome.

Table 3 Classification performances for design №1 (KS vs controls) in the validation group.

Ten out of eleven patients were correctly predicted as KS with our model, and this performance was the same using Face2Gene CLINIC (Supp. Table 1). In addition, we were able to predict all control patients (Fig. 4, Table 4).

Figure 4
figure 4

Classification using design №1 for proband 3 of the validation set. (A) and (B) Frontal and profile views of proband 3. (C) UMAP representation of the training data according to the two groups, with positioning of proband 3. (D) Histogram of predictions by the model. This child was also detected as KS by Face2Gene CLINIC. KS, Kabuki Syndrome.

Table 4 Confusion matrix for design №1 (KS versus controls) in the validation group.

Design №2 : KS1 vs KS2

  1. 1.

    Phenotype

KS2 individuals had a rounder face (HP:0000311), a shorter nose (HP:0003196), a thicker upper lip (HP:0000215), anteverted nostrils (HP:0000463), and a shorter midface (HP:0011800). There was no obvious difference in the eyebrows and eyes. The external ears were more elongated vertically in KS2 (HP:0400004), with a hypoplastic lobe (HP:0000385), and with a counter-clockwise rotation. The conch seemed more vertical in KS1 (Fig. 5).

Figure 5
figure 5

Average shapes in KS1 and KS2 and comparisons after Procrustes superimposition of frontal views, lateral views, and external ears for three age groups. Orange = KS1, Dark red = KS2.

  1. 2.

    Classification

The model was able to distinguish KS1 from KS2 with an empirical AUC of 0.805 (0.729–0.880, p < 0.001) (Figs. 6, 7). This trend was found in the validation group, with an accuracy of 70% without reaching the significance threshold (Tables 5 and 6).

Figure 6
figure 6

Empirical ROC curve (training set) for KS2 with AUC in design №2. AUC, Area Under the Curve; KS, Kabuki Syndrome.

Figure 7
figure 7

Classification using design №2 for two probands of the training set. (A, B, E and F) Frontal and profile views of the two probands. (C and G) UMAP representations of the training data according to the two groups, with positioning of probands 3. (D and H) Histograms of predictions by the model. The phenotype included a reduced height of the midface, a thicker upper lip, and a vertical elongation of the external ear in the KS2 group (E and F). KS, Kabuki Syndrome.

Table 5 Classification performances for design №2 (KS1 versus KS2) in the validation group.
Table 6 Confusion matrix for design №2 (KS1 versus KS2) in the validation group.

Design №3: PTV vs PAV in KS1

The model was unable to detect a difference in facial phenotype between KS1 patients with a PTV compared to KS1 patients with a PAV (0.555 [0.419–0.690], p = 0.786) (Fig. 8).

Figure 8
figure 8

Empirical ROC curve (training set) for KS1 PAV with AUC in design №3. AUC, Area Under the Curve; KS, Kabuki Syndrome; PAV, protein-altering variant.

Discussion

The model we report distinguished KS from controls in the independent validation group with an accuracy of 95.8% (78.9–99.9%, p < 0.001). Only 1 patient out of 24 was classified as ‘control’ while she had KS (accuracy 96%). In the KS group, 10 out of 11 patients were correctly classified (accuracy 91%). Using the Face2Gene CLINIC tool on KS patients (because DeepGestalt technology is not capable of recognizing non-syndromic patients) 1 patient out of 11 could not be analyzed and could not be classified as KS (accuracy 91%). Performances were therefore comparable. Interestingly, the patient not recognized by our model and by Face2Gene CLINIC was of African ethnicity, highlighting the lack of training data for non-Caucasian patients. The distribution of ethnic groups varies greatly from one center to another, which is why we believe it is important to encourage international collaborations in the field of Next Generation Phenotyping.

The model we report was also capable to distinguish KS1 from KS2 with an empirical AUC of 0.805 (0.729–0.880, p < 0.001). Rouxel et al.5 showed that the Face2Gene RESEARCH tool distinguished KS1 from KS2 in a cohort of 66 patients with an AUC of 0.722 (p = 0.022). The same team showed a classification accuracy of 61% (20/33) by clinical genetics experts between KS1 and KS2. The performance of our model was at least comparable to Face2Gene RESEARCH and seemed to outperform that of clinical experts.

Rouxel et al.5 explained that KS1 patients had a longer face and nose, a thin upper lip vermilion and a longer midface in comparison to KS2 patients, who have a rounder face, a thicker vermilion and anteverted nostrils. Our study reports new phenotypic features not seen on frontal images alone for KS2, such as a particular morphology of the external ear, longer along the vertical axis and with counter-clockwise rotation.

Phenotype-genotype correlations have been reported in KS for extra-facial anomalies. Cardiovascular abnormalities, namely ventricular septal defects, coarctation of the aorta, atrial septal defects, bicuspid aortic valve, patent ductus arteriosus, and hypoplastic left heart syndrome52,53,53,54,55 are more prevalent in KS2 compared to KS11,56. Persistent hypoglycemia due to pituitary hormone deficiency, adrenal insufficiency, growth hormone deficiency and dysregulated insulin secretion by the pancreatic β-cells57,58 are also more frequent in KS210, possibly because the inhibition of KDM6A increases the release of insulin from pancreatic islet cells, as suggested by mouse models1,59. Urinary tract anomalies, such as horseshoe kidneys and renal hypoplasia, seem to be more frequent in KS1, and genital defects such as cryptorchidism and hypospadias could be more frequent in KS256,60,61.

Rouxel et al.5 underline the lack of Asian patients in their evaluation, and proposed that larger series were needed to better define phenotypical differences between KS1 and KS2, and the general dependance of the phenotype with ethnicity6,12. The collaboration with an Asian clinical genetics center (Bangkok) is thus a strong point of this study.

The use of textural feature extraction allowed our model account for typical KS characteristics not recognized by geometric analysis (Procrustes) alone. The lateral sparsening of the eyebrows and heavy lashes giving the impression of make-up eyes were thus included into in the classification.

Barry et al.1 reported a large meta-analysis including 152 articles and 1369 individuals with KS and assessed the prevalence of the different types of pathogenic variation per gene. The majority of KMT2D variants were truncating (non-sense 34%, frameshift 34%), then missense (23%) and finally splice site variants (9%). The majority of KDM6A variants were truncating (frameshift 36% > non-sense 27%), followed by splice site (20%), and missense (18%). We found similar results, with a higher prevalence of truncating non-sense variants for both genes. There was a higher prevalence of splice donor site variants, with 26% for KMT2D and 30% for KDM6A. Some authors report a more severe clinical outcomes in patients with non-sense variants than in patients with a frameshift variant1. Faundes et al.56 found more severe neurodevelopmental anomalies in patients with protein-truncating mutations in the KS2 group. Shah et al.62 reported ophthalmological anomalies such as strabismus, blue sclerae, microphthalmia and refractive anomalies that were more severe in patients with a non-sense variant, and less frequent in patients with a frameshift variant. Our model did not find any significant difference in facial phenotype between PTV and PAV.

Conclusion

Here we report an automatic detection model for KS including the face, profiles and ears, with performances (AUC 0.993 and accuracy 95.8%) comparable to those of Face2Gene, on an independent validation set. These performances were achieved using an international cohort of 107 patients with a confirmed molecular diagnosis of KS. Using the same model, we were able to separate patients with KS1 (KMT2D) from KS2 (KDM6A), with an AUC of 0.805. These results seem to at least outperform Face2Gene and support the possibility of using a phenotype-first strategy to diagnose KS and detect its two causal genes.