Introduction

Making a diagnosis in a dysmorphic child requires a high degree of experience and expertise. Although some steps in finding the diagnosis are highly formalized, for example, database searches, others are recalcitrant to standardization. Among these are the physical examination and the evaluation of the overall impression of a patient by the examiner. Imprecise and nonstandardized nomenclature, especially of facial features, places a major difficulty for the communication between clinical geneticists. Other more subtle problems involve the very process of perception, which is subject to influence from many psychological factors, like crossinfluences of perceptions.1 In the wake of these problems, various attempts have been undertaken to characterize facial features on the basis of objective measurements; among them are photogrammetry and anthropometry, which are summarized elsewhere.2, 3 One recent study applies photogrammetry to 3D scans.4 The study shows that landmarks can be positioned reliably, an assumption underlying 2D- and 3D-based systems for syndrome classification. Here we would like to focus on computer-based techniques, which promise to help the clinician in the diagnostic process of syndrome identification by providing an automated analysis of the face. Physicians, not particularly trained in dysmorphology, could potentially benefit by being able to follow a formalized procedure to establish diagnostic hypotheses, which certainly remain to be evaluated and verified by the specialist. Two approaches have emerged that try to automate the procedures of face analysis. These explore either 2D images2 or 3D representation of faces.3, 5 We have contributed to the 2D approach and shall show extended results later. We now briefly describe the technical similarities and differences between the systems and refer to the original publications for details. Both systems start by capturing a raw data set, which is acquired by either a standard 2D digital camera or by a 3D surface scanner. Next, both systems identify landmarks of the face (nose tip, lip edges, etc) with coordinates of points in the data sets (termed the correspondence). In the 2D approach, the coordinates of landmarks are enriched by a representation of the texture in a neighbourhood of the landmark, which is given by a Gabor wavelet transformation. In its current version, the texture of the face is ignored in the 3D approach. Rather, the 3D coordinates describing the surface of a face are used for discrimination purposes. Both approaches use landmarks to define corresponding regions in the face, which allows for a numeric representation of the face with numbers corresponding to well-defined features. Both methods seem to benefit from manual intervention in landmark placement.2, 6 Therefore, both systems result in an objective description of the face, although quite different information is being used. An important question regarding automated procedures and their practical relevance is how well a sizeable number of syndromes can be dealt with simultaneously. Potentially, several 100 syndromic conditions with facial dysmorphisms could be relevant in clinical practice. Here we show an analysis of the classification of 10 syndromes to compare it with earlier results to approach this question. Another important question deals with the decision process during classification. It is reassuring to see a computer using criteria, which are similar to those of humans and even additional criteria that might be implicit to the human decision. We deal with this question by visualization of the decision process.

Materials and methods

Probands

An extensive set of pictures of individuals representing 10 syndromes was collected. Written informed consent was given by parents of the probands or by the probands themselves. Details about the cohorts can be found in Table 1. On average about 12 individuals are present in each group. The age of individuals varied from 1 to 40 years. Results of cytogenetic, molecular or biochemical analyses were obtained for all syndromes, except for Cornelia-de-Lange syndrome, for which no molecular test was available at the time of recruitment.

Table 1 Characterization of the data set

Picture acquisition conditions

Conditions for picture acquisition were standardized as much as possible. Three lighting sources providing soft light were used to illuminate probands. Ambient light was reduced to the extent possible, but could never be eliminated entirely as pictures were primarily taken at meetings of parent support groups in varying settings. A uniform background was used throughout. Standard digital cameras were used to take 2D still pictures (Nikon Coolpix 950, Nikon Coolpix 4500). Additionally, videos using a Panasonic NV-MX350EG were recorded from probands and still pictures were later extracted from the videos. Extracted pictures usually allow for selection of a pose that is exactly rotated into a frontal position, if a rotation of the face was recorded on the video.

Picture standardization and selection

From the collection of pictures for each proband, one picture was selected for the final analyses. The criteria used were sharpness of the image, pose of the face and facial expression. Sharp pictures with a face rotated into a frontal position were selected. By subjective judgement, pictures were further chosen to show as little emotional expression as possible. Pictures were then converted to grayscale and cropped to a size of 256 by 256 pixels. For the probands included in the previous study, we have used the same pictures, which are depicted elsewhere.2 We give five examples of each group in Figure 1.

Figure 1
figure 1

Examples of the facial photographs of the 10 syndromes investigated. Each row comprises one syndrome: microdeletion 22q11.2, Cri-du-chat, Cornelia de Lange, fragile X, Mucopolysaccharidosis III, Noonan, Prader–Willi, Smith–Lemli–Opitz, Sotos, Williams–Beuren.

Picture analysis

Pictures have been analysed using custom software. Picture analysis is a two-step process. In the first step, a face is searched in the picture and nodes (points) are positioned on the face corresponding to a predefined pattern. Additionally, we have created a data set, for which correspondence of nodes to picture positions was established manually (manual node correspondence). A Gabor wavelet transformation was applied at each node.7 This transformation results in 40 coefficients per node, which can be used to locally approximate the picture texture in the neighbourhood of each node. Details of this process are given elsewhere.2, 7

Data set pruning

We have conducted the analysis based on data sets derived from both automatically and manually placed nodes. Additionally, we have constructed data sets for which the nodes on the rim of the face have been left out. The rationale behind this step is to reduce noise by removing potentially uninformative nodes, as hair is expected not to be a stable trait for many individuals and rim nodes also include information from the background. Using all permutations, we conducted all analyses with four data sets. Unless noted otherwise, results are reported for manually labelled data with all nodes included.

Statistical analysis

The data set was analysed using several classification techniques (software package R, version 2.1.0; http://r-project.org). Specifically, these were linear discriminant analysis (LDA), support vector machines (SVM) and kth nearest neighbours (kNN). These techniques were compared to jet voting (JV), a technique used in a previous paper,2 which performs a nearest-neighbour classification at each node and classifies the syndrome taking a majority vote over all nodes. To evaluate classification accuracy, we have performed cross-validation procedures. We used 10-fold cross-validation, the accuracy estimates of which were averaged over 20 runs. We have performed both simultaneous classification and pairwise classification of syndromes. The simultaneous classification serves to evaluate the problem of assigning a syndrome to an unknown face, that is, the problem of diagnosis. Pairwise comparisons of syndromes can be used to evaluate similarity of syndromes and to compare the performance achieved with the current data set with respect to other data sets published thus far.

Before performing LDA, SVM and kNN, we performed a principal component analysis (PCA). PCA transforms a data set into a new coordinate system such that the variances of the new coordinates decrease with increasing rank. The PCA was used to reduce the amount of covariates resulting from picture analysis. In the following analyses, we have used a contiguous block of principal components (PCs) starting with the first PC and including a variable number of subsequent PCs. The number of PCs included was chosen as to optimize classification results. For SVM, we used polynomial kernel functions and optimized classification results varying the degree of polynomials. For kNN, the number of neighbours k was varied. We will refer to these procedures as model selection.

To assess the validity of the classifiers resulting from the statistical procedures, we investigated the linear discriminant (LD) functions resulting from LDA. We have used the LD functions to produce diagrams that display the importance of individual wavelet coefficients in the classification decision. Details are given in Appendix A. These visualizations can be used to exclude artificial classification results, which might, for example, stem from slightly differing acquisition conditions. Additionally, criteria of classifiers can be compared with clinical characteristics of syndromes.

We provide a supplementary document to describe statistical methods in an informal way.

Results

Simultaneous classification

Results for simultaneous classification decisions are reported in Table 2, which demonstrates accuracies after performing model selection. Overall accuracies are shown as well as a breakup of accuracies for individual syndromes. The best overall accuracy of 76% was achieved by LDA. Both SVM and kNN performed worse at 70 and 68%, respectively. The model selection revealed that the simplest choice of classifiers for SVM and kNN with a polynomial degree of 1 for SVM and the number of neighbours also 1 for kNN yielded this result. This reflects the high complexity of the data set, which seems to dictate robust, that is, simple classifiers. JV performed considerably worse at 55%. This result is not surprising as JV does not make use of joint information at different nodes, whereas kNN does. It therefore seems that JV does not have the potential to handle a large number of syndromes. Figures 2 and 3 shows an example of the model selection procedure for LDA showing results for individual syndromes. In general, for 20 included PCs classification rates are better than 70% and change slowly from that point on. However, the joint accuracy is optimal for 36 PCs, indicating that subtle information is contained in PCs in the range from 20 to 40. This diagram also suggests that classifiers could be optimized for syndromes individually and later be combined.8 One apparent exception from the good results for most syndromes is Cri-du-chat syndrome, for which performance dropped to 28%.

Table 2 Classification accuracies of four classification methods with different data sets
Figure 2
figure 2

Example result of the automatic procedure matching a graph in a given picture.

Figure 3
figure 3

Discrimination accuracy graphs for individual syndromes using LDA illustrating the model selection procedure. On the x-axis, the number of PCs used is given and the y-axis shows classification accuracies. Syndromes are abbreviated as follows: Microdeletion 22q11.2 (2), Cri-du-chat (5), Cornelia de Lange (C), fragile X (F), Mucopolysaccharidosis III (M), Noonan (N), Prader–Willi (P), Smith–Lemli–Opitz (S), Sotos (So) and Williams–Beuren (W).

Using a completely automatically analysed data set, the overall accuracy for LDA drops to 52% (Table 2, column 2). This is largely owing to failures in localizing the face in some of the pictures from the automatic process resulting in strong noise signals in the data set.

Data sets with excluded rim nodes performed very similar to the corresponding, full data sets. We therefore do not give any detailed results, but report the overall accuracies. For the manual data set, performance was as follows: LDA 75.8% (40 PCs), SVM 68.7% (degree=1, 18 PCs), kNN 62.2% (k=1, 19 PCs) and JV 55%. For the automatically labelled data set, we obtained: LDA 52.8% (12 PCs), SVM 52.4% (degree=1, 11 PCs), kNN 48.6% (k=1, 10 PCs) and JV 50%.

Pairwise classification

Table 3 shows results for pairwise comparisons of syndromic conditions. These results were achieved using the first seven PCs after selecting the pair of syndromes. Most pair wise comparisons allow for accuracy well above 90%. Problems to discriminate Cri-du-chat syndrome in the joint analysis are reflected by the fact that comparisons with Noonan (79%), Smith–Lemli–Opitz (82%) and Sotos syndromes (81%) are at the lower end of the accuracies achieved for pairwise comparisons. One notable exception is the comparison of Sotos and Noonan syndromes, for which accuracy is 31%, indicating that the correct facial features discriminating these syndromes could not be learned by the classifier. This example is further scrutinized by graphical means below.

Table 3 Pairwise classification accuracy of syndrome discrimination using LDA (seven PCs)

Visualization

In general, it is important to verify classification results by scrutinizing decision rules as these rules might pick up certain characteristics in the data set that are not associated with the classification goal, for example, background information in the case of face classification. We have therefore opted to visualize classification rules resulting from a pairwise LDA classification as described above. Figure 4 shows examples from these comparisons, which summarize the important features that can be learned from these visualizations. Figure 4a demonstrates that the decision rule to distinguish fragile X from Cornelia de Lange syndrome mainly relies on the eye, eyebrow and lower nose region. Additionally, features detected at the lower edges of the ears seem to be important. In general, the pairwise comparisons often highlight the eye region for fragile X and the eyebrow region for Cornelia de Lange syndrome, which consequently overlap when this pair is visualized. These two signals allow for a perfect discrimination of these two syndromes. Figure 4b demonstrates the comparison of Cornelia de Lange with Mucopolysaccharidosis type III. Still signals are located in the eyebrow region. The eyes, however, do not play a role for the decision between Cornelia de Lange and Mucopolysaccharidosis III. In general, the decision rule seems to integrate information from the entire face, sparing the hair and chin region. Discrimination between Smith–Lemli–Opitz and Noonan, depicted in Figure 4c, shows that information from nodes on each side of the nose as well as the mouth region is most important. The examples shown so far correspond well with clinical expectation. The worst pairwise classification rate of Smith–Lemli–Opitz and Noonan, however, which is shown in Figure 4d, visualizes a classifier that does not seem to have learned relevant clinical traits. Especially, the fact that the weights seem to be far from symmetric about the vertical axis seems to be a hint that the data set did not comprise enough information to achieve a reasonable discrimination between Smith–Lemli–Opitz and Noonan syndrome.

Figure 4
figure 4

Visualization of pairwise discrimination rules derived from LDA. The boxes at the graph nodes show the collection of coefficients extracted there, and dark colors represent a strong influence of a node/coefficient on the classification decision. For details see text. (a) Fragile X vs Cornelia de Lange, (b) Mucopolysaccharidosis III vs Cornelia de Lange, (c) Noonan vs Microdeletion 22q11.2 and (d) Sotos vs Noonan.

Discussion

Classification results presented in this study are promising with respect to the problem that was posed initially: Can a computer be helpful in analysing faces if more than a few syndromes are involved? However, there are several caveats in drawing further conclusions, which we shall discuss. It is surprising how few examples per class are sufficient to maintain a stable distinction between syndromes even when the number of syndromes is sizeable. Also, the classifiers seem to learn features similar to those described by clinicians in most cases. As two experienced clinicians reviewed the pictures (GG-K, DW), arguably the data set might be biased towards ‘typical’ appearances. Unless the computer is instructed as to which examples are typical and which are not, the computer is going to treat all examples equally in the learning process, which might hamper accuracy for unseen examples. Conclusively, larger numbers of examples per class seem desirable but not essential to achieve accurate classification decisions even for a large number of syndromes. For real-world application additional challenges like ethnicity, age and mimics have to be accounted for. Whereas ethnicity and age can be handled by enriching the study with appropriate probands, the influence of mimics seems to be best accounted for by an explicit model, on account of the tremendous flexibility of mimics. It has to be noted that the description of mimics is an open research problem, which is tackled independently.9

One specific result to be discussed is that for Cri-du-chat syndrome. An explanation could be the evolution of the facial phenotype in Chri-du-chat syndrome, which shows marked differences of facial features comparing individuals of different ages. Also this is in accord with an often difficult diagnosis of Cri-du-chat syndrome on clinical evaluations alone.10 For the other syndromes, it is interesting to see that classification results are generally robust against age variation. Keeping in mind the results for Cri-du-chat syndrome, it is certainly desirable to explicitly account for age, say by regression analysis, but such an analysis has to be postponed until larger data sets are available. We have therefore opted to include probands irrespective of age, as age ranges were roughly similar. Another point concerning the learning algorithms is that the methods had to be changed as compared with the earlier study, which was extended here.2 This however is expected in light of the fact that a larger data set contains more information that might be best explored by different methods. As syndrome numbers grow, we expect the statistical methodology to be changed again to sustain reasonable classification accuracy. At the moment LDA, a very simple and robust method, performs best, which is reassuring with respect to the validity of the results. Validity was verified by graphical means and again reassured our findings. If classifiers become more complex (eg nonlinear), visualization might become more difficult, however, one possibility is to use subsets of data sets in the future and establish graphical validation as described here. The large discrepancy of classification results between manually and automatically obtained node correspondences seems to indicate that manual steps cannot be excluded entirely from any facial analysis software that intends to extract as much information as possible. However, finding landmarks in the face was not the primary goal of this study. There are several ways to improve automatic processing of data including optimizing our current methods11 or using additional heuristic methods to find the localization of the face in a given picture (Kalina, in preparation).

Compared with the previous study, accuracy dropped slightly from 80 to 75%. Taking into account the number of syndromes chosen, the relative accuracy (dividing accuracy by a priori accuracy) increased from 4 to 7.5, a fact that is promising with respect to the further extension of this study. Further improvements seem to be possible by integrating spatial information from coordinates of landmarks and side views. Preliminary results indicate that classification can indeed be improved using this information.

Running the statistical analysis employed here on the previously used data set2 results in a classification accuracy of >90%.

The most recent paper describing 3D methodology demonstrates an accuracy of 89% for a joint discrimination of five syndromes.5 Pairwise comparisons seem to be similar for both approaches. It should be noted that the analyses are not directly comparable as statistical analyses differ somewhat as well as sample sizes. On account of sample size within syndromes, results for the 3D approach should be more accurate. As for 3D analysis, texture information is captured simultaneously with spatial information, it would be very interesting to see how the combination of 3D and texture information would perform. Whereas capturing of 3D information results in a richer data set and allows for excellent visualization as demonstrated recently,5 2D analysis has several advantages in practical use: equipment is cheap and it is easy to handle. It has to be noted that neither 2D nor 3D methods have direct applicability in clinical practice yet, as the number of syndromes is still very small. However, recent results demonstrate that automated technologies have the potential in amending and enriching the process of finding the diagnosis. Finally, integration of clinical information seems to be critical to establish usable databases. In clinical practice, the relative importance of face and other clinical information varies from syndrome to syndrome and from patient to patient. It is our goal to contribute to this integration.