Personalized quantification of facial normality: a machine learning approach

What is a normal face? A fundamental task for the facial reconstructive surgeon is to answer that question as it pertains to any given individual. Accordingly, it would be important to be able to place the facial appearance of a patient with congenital or acquired deformity numerically along their own continuum of normality, and to measure any surgical changes against such a personalized benchmark. This has not previously been possible. We have solved this problem by designing a computerized model that produces realistic, normalized versions of any given facial image, and objectively measures the perceptual distance between the raw and normalized facial image pair. The model is able to faithfully predict human scoring of facial normality. We believe this work represents a paradigm shift in the assessment of the human face, holding great promise for development as an objective tool for surgical planning, patient education, and as a means for clinical outcome measurement.

. General signal flow and high level architecture of the study model demonstrating the four key steps: preprocessing, normalization, feature extraction, and prediction. As a first step, preprocessing is applied to the original image to obtain a cropped, aligned and masked image. Then, the preprocessed image is normalized by preserving its unique characteristics and filtering out any structural anomalies. As a next step, features are extracted by measuring the distances between normalized and preprocessed images and latents both in the image space and in the latent space. While the features dx perc , dx str , and dx psnr denote the perceptual distance, structural distance, and PSNR distance in image space, dw bc and dw corr denote Bray-Curtis distance and correlation distance in latent space, respectively. After retrieving the representative features, human rating is predicted ( ŷ ) using the MLP regressor with only one hidden layer having 6 neurons. Original image, reference 37  Image preprocessing. Preprocessing was the first step of our model in which a face within an image was detected, centered, cropped, aligned, and masked so that the main components of the face were located in a predetermined orientation 25,38,39 . We labeled the output of this preprocessing phase x raw .
Image normalization. The second step of our model was normalization, in which anomalous elements in x raw following the image preprocessing step were corrected without altering the "normal" features of the face. The output of this step was labeled x nrm . We formalized the normalization procedure as a two-objective optimization problem, considering both similarity and averageness losses. While the objective of similarity loss was to preserve the distinctive elements of the face, averageness loss drove the correction of any structural aberrancies. Optimization was carried out within the latent space W of the StyleGAN's generator G in order to find a latent vector w * in W , which yielded the best results in terms of similarity and averageness. We represented the similarity loss by L S , and the averageness loss by L A . The normalization procedure is formally represented by the optimization problem: where sim and avg are weighting constants that we manipulated in order to perform an optimal trade-off balance between the relative importance of the similarity and averageness losses, and w avg denotes the latent vector of the population average. This optimization is represented schematically in Fig. 2a,b.
Because the success of the normalization operation depended primarily on filtering out the anomalous features of any given image, while preserving the normal elements, proper selection of L S and L A played a vital role in generating a realistic x nrm from a given x raw . In increasing order of algorithmic complexity, we tested the following three methods for L S : pixel loss ( L S pix ), structural loss ( L S str ) 27 and perceptual loss ( L S perc ) 26 . In similar fashion, to measure L A we tested mean absolute error ( L A mae ), mean squared error ( L A mse ), and mean exponential error L A mee . Taken together in various combinations, these six methods of loss measurement provide nine different options for normalization. In addition, an unlimited amount of fine tuning of the system was available by manipulation of sim and avg . For each of the nine combinations of loss functions, we adjusted the parameters so as to generate the best normalized output possible as determined by investigators' visual appraisal (Fig. 3). Using this approach, L S pix was found to generate realistic human images, but we noted that certain basic facial features such as gender, age, and pose were misaligned. Similarly, L S str , generated relatively blurry images. Conversely, L S perc produced highly realistic images without degrading the main features of a given face. Having determined the optimal L S , we then examined the effects of the three different L A on the normalization process and observed that L A mse provided the best result. Another factor critical to the normalization operation was the initial value of the latent vector w, which we assigned empirically. Due to the fact that W is multi-dimensional in nature, simply choosing a random w may not be advantageous because the initial value may fall very far from-and never converge towards-x nrm . Therefore, we chose w avg as our initial assignment for w, and commenced the iteration process from that point. The number of iterations was fixed at 500 since it was observed that no improvement was achieved beyond that point.
In Fig. 4, the iterative progression of our normalization procedure is demonstrated for nine raw facial images depicting individuals with: incomplete cleft lip, complete cleft lip, craniofacial microsomia, Treacher Collins syndrome, craniofacial dysostosis, and left facial palsy. For purposes of concision, only representative steps are shown (rounded integer iterations of ln(i) for i = 1 − 500 ). Notice that at the first iteration of normalization, basic features of the given images such as skin color and pose are set. Then as the iterative process continues, secondary features such as age and gender become aligned. Ultimately, the more distinctive elements of the face are seen to emerge. It is critical to ensure that the latent vector w always resides along the "normal" manifold within the latent space, and does not stray off which could derail the normalization process. Such a condition was controlled within our algorithm by restricting each member of w i between −1 and 1. If during SGD w i escaped that interval, it was replaced by a random number generated within that range. In Fig. 5, 16 representative pairs of x raw and x nrm are illustrated. It can be appreciated that the normalization process effectively remedied the abnormal features of the faces, while strictly preserving all key remaining details. Image feature extraction. Because the objective of our study was to accurately predict the extent of abnormality within a facial image, it was necessary to extract salient facial features in both the image and latent spaces in an effective manner. In order to achieve this within the image space X , distances were measured by employing various techniques including LPIPS ( dx perc ), MSSIM ( dx str ), PSNR ( dx psnr ), mean absolute error ( dx mae ), mean squared error ( dx mse ), root mean squared error ( dx rmse ), and log hyperbolic cosine ( dx lcosh ).
The path to ascertaining distance measures within the latent space was more challenging. While the normalization procedure yielded the latent vector w nrm , it did not yield w raw . To obtain the latter, we repeated the normalization procedure for each raw image by setting avg = 0 to find w raw . That is, L A was discarded, and only L S was used in the normalization operation when implemented to determine w raw .
After obtaining both w nrm and w raw , distances between w raw and w nrm were computed using various distance measures including Bray-Curtis 29 ( dw bc ), Canberra 58 ( dw can ), Chebyshev 59 ( dw che ), Manhattan 60 ( dw man ), correlation ( dw corr ), cosine ( dw cos ), Euclidean ( dw euc ), Mahalanobis 61 ( dw mah ), and Minkowski 62 ( dw mink ). In order to verify the suitability of our selected distance measurements, we tested them against human facial ratings on a training dataset. Figure 6a shows a correlation matrix representing the correlation between our extracted features (first 7 in X , and last 9 in W ), and the correlation between those same 16 features and the human ratings that we  (1). In (a) modeling of the iterative normalization procedure in latent space W is depicted. The dashed curved line represents the "normal" manifold in W , and w * is found along that curve by iterative updates. At each iteration i, the next latent vector w i+1 is found by adding the gradient vector of similarity loss ∇L S and the gradient vector of averageness loss ∇L A to the current latent vector w. Note that L S is calculated in image space X , and gradients are back-propagated to the w. In contrast, L A is calculated in the W space, and thus gradients are calculated directly. These two conflicting gradients were calibrated in a manner that w * satisfies both similarity and averageness objectives upon completion of the normalization operation. In (b) trade-off between image similarity loss ( L S ) and latent averageness loss ( L A ) is illustrated. We obtained x nrm (green) for each x raw (red) using our normalization algorithm. The distances between the images separated by red lines reflect the distance in image space between x raw and x nrm measured with L S perc . Similarly, the distances between the images separated by green lines reflect the distance in latent space between w raw and w avg measured with L A mse . Note that while L S attempts to produce images similar to x raw , the objective of L A is to converge upon the population average.
All numbers above are multiplied by 100 for better visualization. Images A,B,C, above, correspond to references [40][41][42] , respectively. Images within red and blue circles, above, derive from the StyleGAN face generator 25 . All faces within green circles, above, are transformations created by the study encoder.
Scientific Reports | (2020) 10:21375 | https://doi.org/10.1038/s41598-020-78180-x www.nature.com/scientificreports/ collected. Because PSNR measures the similarity between images, dx psnr predictably correlated in a positive manner with the human ratings. Conversely, other features measuring dissimilarity were found to correlate negatively. As a next step, we applied the extremely randomized trees algorithm (ERT) 63 to our training dataset to select the best representative features out of those 16. Similar to the random forests algorithm 64 , ERT is an ensemble learning technique which reduces overfitting and improves efficiency of the model by producing a multitude of individual decision trees and averages them to obtain a final prediction. We fitted 1000 randomized decision trees on various sub-samples of the training data in an effort to enhance the prediction accuracy and reduce the over-fitting by using ERT. The relative importance of the features used to predict the human scores can be seen in Fig. 6b. We set the relative importance threshold at 5% as a minimum criterion to include a feature in our model. Therefore, we selected dx perc , dx str , dx psnr , dw bc , dw can , dw corr , and dw cos as our critical features of interest. We eliminated dw can and dw cos since they were so strongly correlated with dw bc (0.99) and dw corr (0.98), respectively. This yielded 5 key predictive features that we used for the remainder of our analysis.
Prediction of image scores. The final step in developing our facial normality assessment model was to establish its ability to reliably predict human ratings. In order to achieve this, we tested a number of regressors including Linear, Huber, Support Vector, Ridge, Lasso, and Multi Layer Perceptron (MLP) Regressor models. As part of our protocol, we collected scores ( 1 = abnormal ; 7 = normal ) from 80 human raters (average age = 33.96 , 60% male) on 150 private facial images for the purpose of training our regression models. In addition, 50 raters (average age = 27.5 , 46% male) also scored a separate group of 60 open-source images in order to test the efficacy of our system. Hyper parameters of each regression model were optimized by training on 80% of the training data, and our models were validated on the remaining 20% percent of the training data. Each optimized regression model was then tested on our previously unseen public test data.
In order to determine which of the six candidate regression models was the best predictor, we compared each with the human ratings using Pearson correlation (R, higher is better) and MAE, lower is better). In Fig. 7a the relative success of the six tested regression models is shown. The R values of models are quite comparable, falling between 0.87 and 0.90. However, the MAE of the MLP regressor was much lower when compared to other models (0.57). That is, using the MLP our computer model was able to, on average, predict the human rating of facial normality within 0.57 on a scale of 1-7, and with a correlation of R = 0.9.
It was observed that the quality of our normalization output was variable due to the stochastic nature of the process. Therefore, we investigated the effect of generating multiple (K) versions of x nrm using the K-nearest neighbor algorithm 65 (Fig. 7b). We began empirically at K = 9 , and measured the dx perc between x raw and each rendition of x nrm . For example, when K = 5 we selected five x nrm ; i.e., the 5-nearest neighbors of x raw in terms of their perceptual distances dx perc . Ultimate feature determination was achieved by averaging the features of these 5-nearest x nrm versions. We observed that increasing K up to a value of 5 resulted in a corresponding increase in R. Similarly for MAE, the effect of increasing K was beneficial until K = 5 , beyond which diminishing returns were noticed. . Effects of L S and L A in the normalization procedure. We randomly chose three different raw images in order to run our normalization algorithm to test three different L S and three L A functions. The top row represents the raw images as a reference. The remaining matrix of images below demonstrates the effect of the different similarity ( L S ) and averageness loss ( L A ) tools. By examining this matrix it is visually apparent that the optimal combination of L S and L A derives from the simultaneous use of L S perc and L A mse . The three images depicted in the top row, left to right, correspond to references 40,42,43 , respectively. All remaining images are transformations created by the study encoder.
Scientific Reports | (2020) 10:21375 | https://doi.org/10.1038/s41598-020-78180-x www.nature.com/scientificreports/ We also investigated the mean absolute deviation (MAD) of human scores for the training and test data. We plotted in Fig. 8a,b the MAD of the mean human score for each image ( |y i −ȳ| ) versus the mean human score ( ȳ ) for each image in both datasets. It was observed that the statistical dispersion of human scores was relatively less on the extremes of the rating scale (when ȳ < 3 , MAD was 0.80 for training data and 0.89 for test data; and when ȳ > 5 , MAD was 0.68 for training data and 0.52 for test data). In contrast, human scores demonstrated greater variance in the mid-range (when 3 <ȳ < 5 , MAD was 1.19 for training data, and 1.25 for test data). This finding suggests that the judgment of our human raters was more in agreement for the more normal and abnormal images, while there was slightly more discrepancy in the assessment of the more intermediate faces.
Similarly, we plotted the (MAE) of the machine scores |ŷ −ȳ| with respect to human score ( ȳ) for training and test data in Fig. 8c,d, respectively. Distribution of MAE errors measured 0.77 when ȳ < 3 , 0.61 when 3 <ȳ < 5 , and 0.54 when 3 <ȳ < 5 for training data. In contrast, MAE values were measured as 0.53 when ȳ < 3 , 0.56 Figure 4. Iterative progression of our normalization technique. Each row represents the progression of an individual image towards its "normalized" version. The first column is x raw . Subsequent 7 columns denote the iterations rounded to the nearest integer natural logarithm ( ln(i) ) for each iteration i between 0 and 500, respectively. The final column represents the optimal version out of 500 iterations, namely x nrm . References for images in column x raw : row 1 44  www.nature.com/scientificreports/ when 3 <ȳ < 5 , and 0.56 when 3 <ȳ < 5 for test data. This suggests that our machine scoring paralleled the human scoring across the full spectrum from abnormal to normal images (Fig. 8e,f).

Discussion
Greater than 100 billion unique human forms have thus far been produced on Earth 66 . While tremendous phenotypic differentiation exists, arguably the most important layer of individuality is one's distinguishing facial features. With each new recombination, novel variations in facial characteristics such as size, proportion, color, shape, projection, pattern of animation, etc., come forth. This understanding lends some context for the surgeon aiming to reconstruct a deformed face. A facial reconstructive surgeon's objective is not to transform a face so that it matches a population norm: rather, it is to eliminate anomalies while preserving intact all distinctive facial features unique to a given patient. This objective was mirrored by the success of the first phase of the current project: the development of an automated system to normalize the abnormal components of raw facial images.
The subsequent phase of the study was then able to effectively assign a distance measurement between raw and normalized images that closely reflects human evaluation.
In designing a machine learning system to address our stated aims, either a classification or a regression approach could have been undertaken. In a preliminary phase of our work, we tested a classification approach. We collected 2000 internet images representing a wide range of known congenital and acquired facial deformities, and graded them into 5 distinct levels of deformity by visual appraisal. We then trained a convolutional neural network (CNN) to classify the images. What proved problematic to our goal, however, was the enormous variation in age, gender, race, diagnosis, and other characteristics existing within each class. Indeed, faces within each category looked markedly dissimilar from one another. To apply a classification approach to the clinical setting, in which surgeons and their patients place a high premium on the detection of very subtle facial differences, a fine-grained level of measurement is essential. In order to recognize exquisite gradations within a system through the use of a deep learning model, a neural network must be trained using a set of clearly differentiated classes. Thus, our challenge was to assemble a large image database and label it finely on the basis of distinguishing facial characteristics. However, it is difficult for human raters to narrowly classify facial images by features such as race or severity of deformity because the gradations can be indistinct and continuous. In terms of the classification of race, for example, there is inevitable ambiguity due to the historical admixture of human populations. Confidently labeling an individual by skin color, or as Western European versus Central American versus  Similarly, attempting to construct a broad gradient of facial deformity by human rating of images is problematic. Unless vast numbers of training set images depicting normal and deformed faces of all different types can be obtained -and with a relative distribution matching that of the general population-a CNN will be unable to reliably classify new images contained within a test dataset with the discriminating detail required for the clinical setting. The rarity (some as infrequent as < 1/10 6 live births) and variety (dozens) of each of the many different facial conditions encountered clinically makes assembling and labeling an adequate image set prohibitive. It is understandable, therefore, that the results of our classification model were poor. For each of the 5 designated classes, the sensitivity (true positive rate) was 0, 0.55, 0.47, 0.48, 0.88; and the specificity (true negative rate) was 1, 0.87, 0.88, 0.87, 0.91, respectively 67 .
Having uncovered the shortcomings of a classification approach, we then attempted a protocol in which we utilized the LPIPS perceptual similarity measure to compare images depicting deformity with an average male and female face (i.e., mean anchors) 68 . Training a computer model to interpret facial images in a manner aligned with human perception, however, represents a major challenge. Superimposed upon the neurophysiologic components of visual perception are multifactorial sources of human subjectivity, and an innate attraction to certain regions of a face more than others [11][12][13][14] . Numerous perceptual distance algorithms-LPIPS amongst them-have been introduced in an effort to meet these fundamental challenges [26][27][28][29] . We chose to use the recently introduced LPIPS measure because it has been purported to account for some of the nuances of human perception. We tested this distance measure by comparing 25 pairs of pre-and post-reconstructive facial images with their respective gender anchor images. Unfortunately, this approach of measuring the distance between raw images and an average facial standard grossly under-represented the magnitude of difference between pre-and post-operative images relative to the visual interpretation of the investigators. The mean difference between before and after images was only 2.69%. Moreover, post-operative images were labeled as improved (i.e., more normal) only 76% of the time. Our impression was that this method did not provide adequate sensitivity to fine facial changes because the overall digital difference between any given raw image and an average anchor is likely to be extensive. That is, the significance of any particular structural anomaly will be diluted out within the wider backdrop of existing difference between raw and average images. Consider, for example, an average male anchor who appears to be a 30 year old of a given race, when assessing two males with similar modest facial deformities. If one is a 20 year old of the same race as the anchor, and the other an 8 year old of a different race, the comparison may very well be influenced more by race and age than the subtleties of the facial deformity. Therefore, using a normal anchor as a benchmark to expose outlier regions of anatomy is only feasible if it can actually match gender, race, age and all other facial features excluding the area of deformity for every image. This is an unrealistic strategy, as explained above. Furthermore, a similarity measure such as LPIPS is agnostic in how it interprets an image, assessing a landscape of pixels holistically, rather than assigning differential values to certain regions of a face. The significance of anomaly detection and correction using this approach will not be afforded the attention it would otherwise receive from a discriminating human eye.
In considering the study objective further, the process of human facial perception was considered. Within the brain there exists a linked neural network involving various regions in the temporal and occipital lobes that is dedicated to the critical and highly evolved task of interpreting faces 69 . As early as the first few days of life, neonates are able to recognize and mimic faces 70,71 , and the impact of facial perception on social interaction is profound throughout one's life 72 . Over time, and with a vast experience of visual cognition, humans become www.nature.com/scientificreports/ extremely adept at recognizing and distinguishing large numbers of faces, and subliminally detecting and evaluating almost imperceptible changes within a face. Humans are also able to instantly imagine the elimination of those differences. In fact, this is the process that a facial reconstructive surgeon goes through during clinical assessment of a patient: detect an anomaly, evaluate the severity, and imagine its correction. This progression is implemented automatically, and in a unique manner, for each individual face encountered. We determined that a similar approach might enable us to achieve our objective. That is, a comparison protocol between any raw image and it's own normalized analogue (rather than a gender average) might allow for a more discerning extraction of divergent features. This step required the introduction into our protocol of a convolutional neural network face generator (StyleGAN) which delivered a vast potential for creating new faces. The digital data contained in this GAN can be loosely compared to the 20,000 protein-encoding genes in the human genome that permit the formation of an almost boundless number of recombinant human forms 73 . We then designed an algorithm that would iteratively transform any raw facial image-along with information deriving from our generator-into a normalized analogue, by eliminating anomalies while maintaining all unique facial details. By revealing the aberrant regions of the face through the normalization process, we were then able to extract distance features. The normalization step was fundamental to this process; one is unable to define abnormal without establishing a complementary definition of normal. Once discerning the key anomalous features within our raw images, we then trained an MLP model that was able to predict human scores with an average error of 0.57 on a 1-7 Likert scale, and with a Pearson correlation of 0.90 between our machine:human scores. To further validate our model, we then tested our design on the same 25 pairs of pre-and post-operative images that we used in the preceding trial. Our final model performed at a far superior level: mean difference detected between before and after images 48.87% (vs. 2.69%); and post-operative images were labeled as improved (i.e., more normal) 96% of the time (vs. 76%).
As stated above, a facial reconstructive surgeon learns to automatically detect an anomaly, evaluate its severity, and imagine its correction. It can be argued that our model functions so effectively by applying a similar progression, though in a different sequence. First, it normalizes a facial image by filtering out anomalous anatomic features using both a state-of-the-art GAN and an image similarity tool. Then, each raw image is compared to its normalized version using several candidate distance measures from different domains, thereby uncovering and quantifying the anomalous facial elements. Finally, implementation of an optimal regressor integrates rawnormalized distances into a predictive model of human normality ratings.
Certain limitations of our design were noticed. Because our stated goal is to discern granular changes in a face, the system is dependent on the availability of input images of adequate resolution (in general, higher than 256 × 256 ). Image orientation is also critical to the system since the face detection algorithm that we employ hinges on capturing key structural elements in the face (i.e., eyes, mouth, nose). Fortunately, within the clinical setting, images will routinely be of adequate quality to satisfy the model's requirements. Another possible limitation in our study is that our regression model was trained using human ratings of facial normality. It is appreciated that human appraisal can be influenced by a variety of cognitive biases [74][75][76][77] . For example, it is plausible that individuals may rate children's faces with deformity more favorably than adult faces. Race and gender may also impact human ratings in ways that one would not expect an objective machine system to respond. Thus, with training based on human ratings, we have "built" human bias into our computer model. One could argue that this represents either a limitation or a strength of our design.
The use of the face generator, while fundamental to our study approach, also represents a limitation. StyleGAN was trained on a dataset of "70,000 high-quality PNG images at 1024 × 1024 resolution... [containing] considerable variation in terms of age, ethnicity and image background" 25 . While no further details about the dataset are available, its diversity is truly borne out by the ability of our model to consistently and reliably normalize a broad variety of faces across a spectrum of gender, age and racial profiles. We surmise, however, that StyleGAN contains a preponderance of female images because we noted that the computed population average appeared to be female in gender (Fig. 2b). Thus, it was necessary for our design architecture to overcome that bias over the course of the iterative process. Fortunately, it appears to have done so both effectively and reliably (Fig. 4).
Currently, clinical outcomes of surgical interventions are considered primarily through a subjective assessment by patient and surgeon. Over the past decade, there has been increasing attention paid to the development of patient-reported (as opposed to provider-reported) outcome measures (PROMs). While there is no denying the importance of patient/parent attitudes regarding treatment outcome, PROMs are not free of human bias 78,79 , and may not tell the entire story regarding outcome. A PROM doesn't likely get at the essence of what others may see in a patient's pre-and post-operative face. Other available outcome measurement techniques that do gauge structural parameters of a face are not benchmarked against a given patient's own theoretical facial norm, and are not easily adapted for use in the clinical setting. This is where our innovative model promises to fill a crucial gap. We are currently working towards making the technology applicable for handheld smartphone devices. By offering a convenient and truly objective means of measuring both the degree of facial deformity and the extent of improvement achieved through surgical intervention, this approach would represent a fundamental advance in the field through the translation of machine learning to medicine.

Methods
Nineteen raw facial images were used in this study, all open-source obtained from the internet, and licensed for re-use through the creative commons (all sources included in References). An additional 125 images displayed in this article are newly created transformations of those original 19 images, generated by our novel computer model. A given image x raw ∈ R n×n×3 is modeled as superposition of the images x nrm and x a where x raw , x nrm , and x a denote the raw image, the normalized image, and the anomaly of the given image, respectively. To calculate the anomaly of a given facial image, x a should be quantified and measured objectively by a function. However, Scientific Reports | (2020) 10:21375 | https://doi.org/10.1038/s41598-020-78180-x www.nature.com/scientificreports/ it is not possible to directly quantify x a since it is very dependent on x raw . Further, to the best of our knowledge, there does not exist in the literature a direct metric to put x a on a scale. Short of being able to directly quantify x a , we therefore adopted an indirect approach. The initial step of preprocessing involved the detection and alignment of images. Then, we semantically filtered out x a from x raw to arrive at x nrm . After obtaining x nrm , we then tested a variety of distance measures to measure the difference between x raw and x nrm . This distance reflects the impact of any given facial anomaly, x a . As a last step, we built a regression model to predict the extent of facial anomaly represented by the extracted distances in the previous step. These steps are briefly described below.
Preprocessing. We first detected the face in the given image using dlib's face detection algorithm 38 . Next, the specific points of the face such as chin, eyes, eyebrows, nose, nostrils, and mouth were detected using dlib's 68 point landmark detector 38 . Then, we cropped, resized, transformed, and aligned the face by calculating auxiliary vectors from eye to eye and eye to mouth 25 . A face mask was also applied to the aligned image by filling the convex polygon of the outermost points of the face using opencv 39 . At the last step of the preprocessing, the masked area (convex polygon) was enlarged 50 pixels to its border to create some space for the normalization algorithm (refer to Fig. 1). Since StyleGAN produces images of size 1024 × 1024 × 3 , the same size was preserved.
Image normalization. The normalization procedure involved two simultaneous and conflicting objectives working in balance. The first objective was similarity, which is measured in X between x raw and x nrm , and was responsible for preserving the "normal" features of the individual. The second objective was averageness, which was measured in the latent space between the optimization variable w and the average vector of the population w avg , and was responsible for eliminating the abnormal elements of the facial image. The aim of the normalization procedure is to come up with a candidate latent vector which satisfies both goals. However, every face is unique, and it is practically impossible to build a facial database that could provide these two criteria simultaneously for any given face. Thus, candidate faces should be dynamically generated on the fly. Generative neural networks such as variational autoencoders (VAEs) 80 and GANs 81 are two major generative models which can generate these candidate faces. StyleGAN 25 is currently the state of the art GAN model, and can generate highly realistic faces in very high resolution. Therefore, we decided to use StyleGAN's generator as our face source. For any given latent vector w ∈ R 18×512 , StyleGAN can generate its corresponding face x ∈ R 1024×1024×3 by using its generator G. To both decrease the complexity of the optimization and to produce realistic images, we used a R 1×512 latent vector and tiled it 18 times across the layers of generator's input.
In Algorithm 1 the steps of the proposed normalization method are described. The optimization variable w and optimization results w * are initialized as population average w avg , and the best loss value L * is initialized to infinity. After the initialization phase, loss value L and its gradient ∇L with respect to w are calculated and the w value is updated at each iteration. Moreover, if the current loss L is less than the best loss L * , the best loss L * and the best latent w * are saved for later use. The described loop is iterated for n=500 times, and w nrm = w * and corresponding image x nrm = G(w * ) are returned as outputs of the algorithm. We used L S perc as L S and L A mse as L A with constants sim = 10 and avg = 100 , respectively. Feature extraction. A number of different conceptual approaches were utilized in an effort to quantitate a difference between images. Among the most effective were PSNR, MSSIM and LPIPS.
PSNR is a very basic tool to measure the image difference since the images being compared are almost the same except the area which includes the abnormality. PSNR represents the ratio between the maximum possible value of a signal and power of noise (distortion) 28 and is evaluated as: (2) dx psnr (x, y) = 20 log 10 (MAX I ) − 10 log 10 1 H l W l h,w ||(x hw , y hw )|| 2 , Scientific Reports | (2020) 10:21375 | https://doi.org/10.1038/s41598-020-78180-x www.nature.com/scientificreports/ where H and W stand for the height and weight of the image, respectively, h and w denote the index variables along the H and W directions, and MAX I represents the maximum possible value of the pixels. MAX I is set to 255 since each pixel is represented by one byte. In addition to PSNR, we also employed the MSSIM which quantifies the image degradation between two pairs of images 82 in a sliding window manner. Given two images x and y, the SSIM is expressed as: where µ x and µ y stand for averages of the current sliding window, σ 2 x and σ 2 y denote their variance, σ xy is the covariance of x and y, and C 1 and C 2 represent some constants. We kept the length of the window, C 1 and C 2 at their default values, 11, 0.01 and 0.03, respectively. MSSIM is the advanced version of SSIM and incorporates image details at different resolutions obtained by applying low pass filtering and down sampling 27 . Since MSSIM values are limited to [ −1, 1 ] and it is employed in our design as a structural distance measurement tool instead of similarity, we used the dissimilarity version of it, namely, which produces the value 0 for exactly same image pairs and 1 for exactly different image pairs.
It has been verified in 26 that features obtained from the hidden layers of the Deep Convolutional Neural Networks such as SqueezeNet, AlexNet, and VGG are exceptionally good to measure the perceptual difference between two images. Therefore, we employed LPIPS to quantify the distance between two given images. The distance between two images is calculated via: where l denotes the layer of the network, H l and W l stand for the height and width of the layer l, h and w represent the current indices of H and W, w l contains the optimized weights for layer l, and ⊙ denotes the vector multiplication operation. At each employed layer, the first step is to extract and normalize the features along all pixels. Then each layer is multiplied by a layer specific constant w l . The final score is calculated via L 2 distance and summed along l layers. In this paper we used the VGG network and its conv1_2 , conv2_2 , conv3_2 , conv4_2 and conv5_2 layers as perceptual measure 83 .
In addition to these three powerful image similarity functions, we also obtained various basic features by comparing images pixel by pixel. Namely, we calculated mean absolute error ( dx mae ), mean squared error ( dx mse ), root mean squared error ( dx rmse ), and logarithm of the hyperbolic cosine error ( dx lcosh ) between x raw and x nrm along their pixels i via the Eqs. (6)-(9), respectively 84 , (see Table 1).
Similar to the image space X , we obtained several features in the latent space W by measuring the distance between w raw and w nrm . The most basic metrics are the L p norms for p = 1 , 2, and 3 labeled with Manhattan (10), Euclidean (11), and Minkowski (12) distances, respectively 62 . In addition to these three basic metrics, we calculated other dissimilarity measures from diversified areas. Bray-Curtis distance (13) is generally used in ecology to compute the population differences of species at two different locations 29 . Canberra distance (14) is another measure mostly used to detect intrusions and to compare ranked lists 58 . Chebyshev distance (15) quantifies the difference between two vectors as the maximum absolute difference between the entries of the two vectors 59 . Correlation distance (16) measures the difference between two vectors as the ratio of their centered dot product to their Euclidean distances. Cosine distance (17) simply measures the angle between the vectors. Mahalanobis distance (18) measures the distance between two vectors extracted from the same distribution by taking into account the correlation between components and rescaling each of them to unit variance 61 . All defining distance measures are tabulated and discussed in Table 1 for easy reference.
Feature selection. Having extracted a number of possible promising features, we further analyzed the correlation and importance of the features to select the most representative ones to predict human scoring. In this regard, first, the feature correlation matrix was depicted as a heatmap plot to visually assess the extracted features (Fig. 6a). Then, to quantify the relative importance of these features, we applied the ERT 63 regression model.
As an ensemble method 85 , ERT aims to improve robustness and generalization capability over a single predictor by building numerous base individuals and calculating the average of each predictor as the final output. By averaging the predictions of these individual trees, it reduces the variance of the final prediction and hence improves model's efficiency and reduces overfitting 86,87 .
The relative importance of a given feature was associated to its depth within the randomized decision tree. Features contributing to the final score prediction for a higher proportion of input data generally reside at the lower level of the tree 88 . Therefore, the relative importance of a feature could be estimated from the expected proportion of samples they contributed to 89 .
After running 1000 ERT, the relative importance of each the 16 features was obtained (Fig. 6b). A 5% relative importance threshold was set as criterion to include a feature for human scoring prediction. As a consequence, the number of features was reduced from 16 to 7. We further removed dw can and dw cos from the selected feature set since they were highly correlated feature pairs ( ≥ 0.98 ) with dw bc and dw corr , respectively.  90 . In addition to the least squares optimization, Ridge and Lasso Regressors manage the variance in the target variable by penalizing the magnitude of the predictor weights via a regularization (shrinkage) parameter α . While Ridge Regression applies an L 2 regularization onto the predictor weights to minimize the impact of irrelevant features, Lasso regression seeks to minimize the number of nonzero predictor weights through L 1 regularization which helps to select only the relevant features. Huber loss function differs from the above-mentioned regression models and assumes a quadratic penalty for small residues and a linear penalty for large residues. Therefore, it decreases the sensitivity to random errors in the target variable and increases robustness 90 . As a regression counterpart of the Support Vector Classifier, Support Vector Regression (SVR) tries to predict the hyperplane fitting the target variable by maximizing the margin and keeping the error within a threshold 91 . Therefore, only the support vectors residing in the margin contribute to the decision boundary and determine the error tolerance of the fitted hyperplane. Although SVR is a non-parametric technique, it is still affected by outliers, because of the possibility of selecting outliers as support vectors. MLP is a feed-forward type artificial neural network consisting of one input layer, one or more hidden layers, and one output layer. MLP can model any linear or nonlinear data by utilizing the nonlinear activation function in its hidden layers and output layer. MLP is trained by using a backpropagation algorithm which iterates backwards (+) Very low complexity (−) Semi-metric, pixel-wise method, highly noise sensitive, inefficient (+) Metric, very low complexity, better than dx mse (−) Pixel-wise method, noise sensitive, inefficient dx lcosh (x, y) = i log(cosh(x i − y i )) (9) (+) Low complexity, better than dx mae , robust to noise (−) Semi-metric, pixel-wise method, inefficient (+) Very low complexity, robust to noise, efficient (+) Metric, very low complexity, robust to noise, efficient (16) (+) Low complexity, robust to noise, better than dw cos , efficient www.nature.com/scientificreports/ the errors from the output layer to the lower layers, and feed-forwards the weight updates from the input layer to the higher layers 92 . All the implementation was carried out in Python 3.6 using sklearn 89 , scipy 93 , keras 94 and tensorflow 95 Python libraries on Intel i9-8950 HK CPU 2.90GHz with NVIDIA GeForce RTX 2070 GPU. The model hyper-parameters were optimized using the tree-structured Parzen estimator (TPE) 96 algorithm using hyperopt 97 . TPE is a Bayesian sequential model-based optimization which uses previous trials to explore new set parameters in the search space in a tree-like fashion. We first determined the parameters of each regression model and optimized them with a 5-fold cross validation technique on our private dataset. The linear model does not present any hyperparameter. However, Huber's hyper-parameter ǫ was found to be 1.1. SVR's tree fundamental hyper-parameters were optimized to a linear kernel, C = 0.1 , and γ = 0.01 . The α values of the Ridge and Lasso regressors were set to 10.0 and 0.001, respectively 89 . For the MLP model, the hidden layers and neurons were found to consist of only one hidden layer with six neurons. In addition, the activation function was optimized to exponential linear unit for each layer; loss function was found as MSE; and batch size and epoch count were determined to be 20 and 300, respectively 94 .
Collecting human ratings. The Sidra Medicine Institutional Review Board was consulted prior to publication of this study regarding the collection of human ratings of facial images. Because (1) human raters were Figure 9. Box and whisker plots of human ratings for the training data (80 raters, 150 images, (a) and test data (50 raters, 60 images, (b). The first, second (median) and third quartiles ( Q 25 , Q 50 , Q 75 ) for training data: x raw = (1.62, 3.27, 4.38) ; x nrm = (4.69, 6.10, 6.54) ; x rnd = (6.40, 6.59, 6.77) . For test data: x raw = (2.58, 3.34, 5.12) ; x rnd = (6.74, 6.85, 6.89) . Note outlier ratings at the lower extremes of both x rnd plots are highlighted. In (c) are displayed some representative images from the test data, along with their corresponding mean human ratings and corresponding rating percentiles within the test dataset. Images in the top row are duplicates of raw images exhibited and referenced earlier in this article, and images in the bottom row are transformations created by the study encoder.
Scientific Reports | (2020) 10:21375 | https://doi.org/10.1038/s41598-020-78180-x www.nature.com/scientificreports/ recruited from the community at large via word-of-mouth, (2) rating was performed remotely online via password-protected, single-use links, and (3) no personal identifying information was obtained from participants, it was advised that formal consideration of the protocol was not warranted. We obtained the human ratings using two different surveys with distinct faces to create a private training dataset and a public test dataset. For the private training dataset, 80 volunteers aged 18-65 rated 150 images (50 raw with deformities, 50 normalized, 50 randomly generated by the Style GAN). Then, we generated 80 uniquely different surveys, each containing 36 images (12 images from each of the three groups). Raw-normalized image pairs were intentionally placed into different surveys to prevent possible rating bias. For the public dataset, 50 volunteers aged 18-65 rated 60 images (30 raw, 30 random generated by the Style GAN). All images were rated on a 1-7 Likert scale (1 most deformed, 7 most normal). Normality rating distributions are illustrated in Fig. 9a,b for the training and test data, respectively. For the test data, surveyors rated 60 images in five different randomly shuffled orders. Some representative images from the test dataset, along with their corresponding mean human ratings, are shown in Fig. 9c.

Data availability
Data supporting the findings of this study are divided into two groups, published data and restricted data. Published data-including open-source images with corresponding normality ratings generated by our model-are available from the corresponding author upon request. Restricted data contains information that could compromise research participant privacy/consent and are not publicly available.