Study on emotion recognition bias in different regional groups

Human-machine communication can be substantially enhanced by the inclusion of high-quality real-time recognition of spontaneous human emotional expressions. However, successful recognition of such expressions can be negatively impacted by factors such as sudden variations of lighting, or intentional obfuscation. Reliable recognition can be more substantively impeded due to the observation that the presentation and meaning of emotional expressions can vary significantly based on the culture of the expressor and the environment within which the emotions are expressed. As an example, an emotion recognition model trained on a regionally-specific database collected from North America might fail to recognize standard emotional expressions from another region, such as East Asia. To address the problem of regional and cultural bias in emotion recognition from facial expressions, we propose a meta-model that fuses multiple emotional cues and features. The proposed approach integrates image features, action level units, micro-expressions and macro-expressions into a multi-cues emotion model (MCAM). Each of the facial attributes incorporated into the model represents a specific category: fine-grained content-independent features, facial muscle movements, short-term facial expressions and high-level facial expressions. The results of the proposed meta-classifier (MCAM) approach show that a) the successful classification of regional facial expressions is based on non-sympathetic features b) learning the emotional facial expressions of some regional groups can confound the successful recognition of emotional expressions of other regional groups unless it is done from scratch and c) the identification of certain facial cues and features of the data-sets that serve to preclude the design of the perfect unbiased classifier. As a result of these observations we posit that to learn certain regional emotional expressions, other regional expressions first have to be “forgotten”.


Background
There are numerous works on emotion recognition.Each of them has different approaches and often use different modalities.In 25 the authors try to recognize emotions by speech signals, while authors in 26 recognize emotions from text, and in 27 the authors recognize emotions via brain signals.There are also approaches that combine different modalities resulting in multi-modal systems 21,28,29 .In this paper we focus only on emotion recognition from facial expressions.There are several distinct emotional cues used for image-based emotion recognition.In addition there are four super-categories of general emotional attribute descriptors that we consider as separate emotional cue.
Macro-Expressions.The most common approach used to classify emotions is to assign labels to a single facial expression.Approaches based on this model extract image-level features and directly map them to emotional labels.While the macro-emotions are not emotional attributes or cues but rather represent categories of facial expressions, an empirical mapping from images to categories is considered a specific type of holistic cue.For instance, using a deep neural network 14 introduced the FER2013 dataset and an end-to-end learning approach for image level emotional labelling.Similarly, more recent approaches have exploited this approach such as in 30 .Current works are generally based on optimizing deep-learning models (especially convolutional neural networks).For example 31 , increased network depth by adding an inception-residual block 32 replaced the last softmax layer with a linear SVM and optimized the loss from the L2-SVM (DLSVM) instead of the cross-entropy loss.Most common macro-expressions labels are: M a = {angry, scared, happy, sad, surprised, neutral}.
Micro-expressions are considered as an indicator of the true mental state of a person due to their spontaneous and subtle nature 33,34 .Micro-expressions are considered to be precursors of macro-expressions and they have been studied experimentally only in the last 10 years.There are two steps for micro-expression recognition.In the first step, spatial features are extracted from a set of video-frames.In the second step, temporal features are combined with the spatial features to form a micro-expression 35,36 .These two steps were recently combined in spatio-temporal features using 3D CNN 34 .The labels for micro-emotion recognition are M i = {positive, negative, surprised} as well as an extended micro-emotion label-set M i+ = {positive, negative, surprised, neutral}.
An action unit (AU) is a component of the Facial Action Coding System (FACS) that represents the movement of specific muscles of a face 1 .To recognize facial expressions belonging to a specific emotion, all active AUs are detected and then a combination of the detected AUs is classified into a specific emotion.Initial works used AU extraction models based on appearance (texture information of the image), geometry (shape properties of a face), and hybrid features 37 .However, newer models use deep learning methods [38][39][40] for AU classification.There is a total of 64 AU specifying observable changes on the face obtained from 41 .From these we use eleven AUs that are used in standard emotional recognition approaches M u = {1, 2, 4, 5, 6, 7, 12, 15, 20, 23, 26} .For example, if a face in an image is detected to have AUs 6 and 12 active, it indicates that the person is expressing happiness 30 .
Image Features are extracted as image agnostic descriptors using a pre-trained convolutional neural network.The image features can be extracted through a variety of existing networks or a specific network can be trained specifically on a target dataset 42 .Usually, the image features resulting from an input image are further used for classification and are considered to be one of the most descriptive.Image features are also one of the most common approaches when generating Macro-expressions (cf.above).While these features are not completely content agnostic, their descriptive power has been shown to be almost universal 43 .Therefore we use the image features as general descriptors and we intend to use them to fill the gap in recognition that other emotional specific markers are not able to cover.The features are extracted using a VGG11 model, pre trained on ImageNet.

Proposed method
The principle behind the proposed methodology is to determine if the existing detectors of emotional cues can be combined in order to bridge data bias and regional bias.For this purpose we consider four different emotional cues, each detecting a different aspect of the facial emotional expression.In particular we use a context-independent set of features, a set of physiological markers, short term, involuntary and low intensity facial expression features and finally a set of macro-emotional labels.The overall schema of the used components for the facial attributes/emotional cues is shown in Fig. 1.
Going bottom-up, the the image features F are considered, region, task and context-independent low-level features.These features were obtained as a result of training on a general dataset that contains a variety of objects, animals and humans.They are expected to be less biased towards any regional data.We selected these features in order to get a feature set as diverse as possible; a feature set independent from the target topic can capture facial attributes that more specialized feature sets can repress due to their low statistical significance.
The second from the bottom are the Action Units U .Mapping of muscular motion through observable facial changes provides a set of coordinates of facial markers that we consider emotion and individual expression independent.While the AU are face-specific markers, we consider them to be region-independent because the muscle activity measurement is a face-specific feature that occurs in any human and is not related to emotional expression only.The action units are used to calculate individual emotional expressions by considering only a subset of facial muscles because the motion of all facial muscles is not necessary to determine muscles' spatial offset characterizing an emotional expression.Therefore the AU can be seen as specific facial-like features without being biased towards any facial type or social behavior.
Third are the features obtained from video recording of emotional expressions I .Micro-expressions are con- sidered to be involuntary, small magnitude facial expressions occurring in time.By definition micro-expressions are simpler (less various), have smaller magnitude and can occur spontaneously or as a precursor to full macroemotional expressions.Therefore, we use the micro-expressions in order to detect emotional expressions of smaller magnitude and variety.In addition, the micro-expression detector is trained on short video sequences: the principle when classifying images with a sequence detector is to classify facial features that match a specific frame index in a sequential emotional expression.Therefore the resulting classification contains information on trends of how this emotional expression image evolves over time.Also, because the micro-expressions are not full fledged emotional expressions, we consider the features extracted from the image to be mapped to only a sub-set of macro-emotional labels.Finally, because the intensity of micro-expressions is in general quite low, we consider the result of micro-expression detection complementary to the detection of other emotions and facial expressions.
The last facial attributes are the macro expressions A .The macro expressions represent the standard facial expression labels such as happy, angry or sad.Unlike the other emotional cues, the macro expressions while being a set of emotional cues they are also the target of facial emotional expression detection: in our experiments the final target is the classification into the macro-expression labels.The macro-expressions are however also detected by a dedicated detector and the result of the detection is similar to all the other emotional cues.The features used for determining the macro-emotions are obtained from training on an image dataset of facial emotional expressions.As such they are very specific to the human face, to emotional expressions and should be the most sensitive to regional bias.
For most of the emotional cues used we define a classifier.Each classifier takes as input a specific emotional cue and outputs a label of recognized emotional expression.The features, detectors, mappings and the presence of a classifier of all used algorithms are shown in Table 1.The column of Table 1 shows the features extracted from the input data, the second column shows the detector that provides this information and the third column shows the mapping of the detector from the input image to the emotional cues.The fourth column indicates if in the default usage a classifier is used to generate the emotional cues.For instance in the case of image features F the emotional cues M f are equivalent to F (the change of nomenclature is introduced for a unified representation).However the facial features from image A are mapped to emotional label M a by a classifier.
Our method starts by first taking an image (frames), process it with the different emotion cue detectors from Table 1.The emotional cues from all detectors are combined and are classified into a label l ∈ M a .The classifica- tion is performed using an attention module that predicts the emotional label.
The architecture of the proposed multi cues emotion model MCAM is illustrated in Fig. 2. We use an action units detector (Fig. 2a), a macro-expression detector (Fig. 2b), a micro-expression detector (Fig. 2c and d), and image features extractor (Fig. 2e).We combine them using an attention-based network (Fig. 2f).

Models. Each of the models used in this work is a well-known algorithm. The MCAM attention network
A is a multi-layered perceptron.It contains three hidden layers, each with ten neurons.The size of input layer varies according to the configuration.The rectified linear unit is used as an activation function.The learning rate is 0.001.The image features detector D f is a pre-trained VGG11 CNN which outputs the image features  Table 1.The detectors of emotional cues used in our approach.Significant values are in bold.

Feature
Detector Mapping Classifier Vol.:(0123456789) The output vector of D f contain 2752 values.The AU were obtained from D u that uses OpenFace 2.0 toolbox.The M u are extracted from the detected face and are classified using an SVM 41 .Intensities of 11 AUs that are responsible for 6 basic emotions were used in our model.Micro-expressions cues M i and M i+ are obtained from a 3D CNN trained on a sequence of images taken in a very short time span (trained on SMIC and SMIC+ respectively).We use single image as a sequence to obtain an emotional label M i = D i (I) from the 3D CNN trained and evaluated on sequence of different images.In addition, as the 3D CNN trained on the SMIC dataset showed low accuracies for datasets FER13 and CK+, and the facial expression datasets contained a neutral face, the 3D CNN was fine-tuned with the CK+ dataset 24 and re-classified for four facial expressions M i+ = {positive, negative, surprised, neutral} .This detector is referred to as M i+ = D i+ (I) .In addition M i and M i+ were evaluated on agglomerated categories: {angry, scared, sad, } → {negative} , {happy} → {positive} , {surprised} → {surprised} and {neutral} → {neutral} was used only for M i+ .
The macro expressions M a were obtained from a D a based on the model proposed in 42 .It performs the map- ping M a = D a (I) .The result contains labels from angry, scared, happy, sad, surprised, neutral.

The attention network.
The attention network assigns weights to each of its inputs in order to strengthen the impact of different input modalities.While the attention computational mechanism is well known we follow the approaches of some of the more recent works 44,45 .The concept of the model computational attention is that successive attention models are able to extract the most important verbal tokens and provide the state-of-the-art accuracy in the translation tasks.We use the concept of attention by combining the emotional cues and image features and train the attention network for the final classification.In particular our model computes the soft weight for each of the possible emotional labels.The method is therefore an instance of soft algorithm selection where the output of the attention network is the confidence score distributed over each of the available target emotions.
All together the cues from M u , M a , M i , M i+ and M f are used as input to the attention multi-layer percep- tron (Fig. 2f).We define S = {M a , M i , M i+ , M u , M f } as the set of all emotional cues.We will refer to a subset of S as follows.A subset S −i refers to the emotional cues that are missing from the subset, therefore in this case We also define that a subset S f ,i contains only the {M f , M i } emotional cues.The introduction of this positive and negative notation is convenient for indicating the components of the original set S more concisely.
The MCAM attention model represents a class of parametric models specified in terms of the input emotional cues.In order to represent the various configuration we will use A subscripted by the indices of used emotional cues set.The default model A is used to represent the MCAM model that uses all studied emotional cues defined in the S set.An instance specified by A −i+,−u indicates an attention model that takes as input the set Training.The combination of the different emotional cues is tested on a series of cross-dataset evaluation tests: we train the emotional cues on one region-specific dataset and evaluate them on a different region-specific dataset.Specifically, we combine the regionally pre-trained, region-agnostic and task-agnostic detectors to determine how the region-specific methods can be manipulated into a universal, region-independent emotional expression detectors.As the used emotional cues range from holistic emotional expressions labels to low-level image features, the proposed methodology can be described on two different levels.First we analyze how micro and macro-expression detectors, trained on regional data, are affected by the regional specificity.This means that we consider combinations of regionally trained facial expression detectors and determine how they generalize to other region-biased expressions for which they have not been trained.Second, we study how image features and the action units can help to regionally-debias the regionally trained emotional expression detectors.While the image features are used in the macro and micro emotions detectors they are subject to bias due to them being used as an input to a region-specific classifier.Using these features as-is is therefore a method to bridge a possible gap in representation of the facial information that might be lost during the learning.
Three sets of experiments are used to determine the emotional cues interference and bias.At first, we use pre-trained emotional cues detectors and learn the attention using an MLP.This experiment is then extended into learning attention over individual detectors: we learn the attention using a single detector at a time.The last approach is focused on fine-tuning the individual detectors and then learning the attention.
Global and local adaptation.We distinguish two types of learning approaches: globally and locally adapted features.In the global adaptation mode (GAM) we train a network initiated from scratch on the target dataset while in the local adaptation mode (LAM) we only fine-tune a model already trained on a different dataset.For instance, in GAM mode, to classify the JAFFE data, we train the attention module on the train set of JAFFE and evaluate the MCAM on the eval subset.In the LAM mode, we first fine-tune the existing emotional cues on the training dataset of the target regional data.Then we train the attention module on the same training set and evaluate the whole MCAM on the eval dataset.
The image features detector D f and the action units detector D u are not learned or fine-tuned but are instead used pre-trained and out-of-the box.The macro emotional expression detector D a was a VGGnet based CNN (macro emotional labels), pre-trained on FER2013.The micro and micro+ emotional expression detectors D i and D i+ respectively, were both trained on SMIC.The main difference between D i and D i+ is that D i+ was also fine tuned on CK+.
The two approaches GAM and LAM thus result in detectors that are either trained on FER2013 and SMIC with the attention network trained on the target dataset or in detectors fine-tuned on the target dataset with the attention trained on the same target dataset.
for additional details on models, training and testing procedures please consult the appendices.

Datasets.
For training the MCAM model, we separate data into train and test splits with 80% and 20% ratios.
In total we consider five datasets: FER2013, SMIC, JAFFE, TFED and TFEID.The two datasets FER2013 and SMIC were used only for the training the macro-emotions detector D a and micro-expression detectors D i /D i+ .

Accuracy of the baseline models.
In this section we provide experimental results for the state-of-theart methods, A , A −f and the individual D i , D i+ and D a emotional-cue detectors.The general concept of these experiments is to simply verify the accuracy of the detectors that provide the emotional expressions labels as output by default.The summary of the results is shown in Table 2.The last three columns show the accuracy of macro emotion recognition using directly D a , D i and D i+ .The individual detectors D a , D i and D i+ are not trained on datasets that were used for testing, except the D i+ (fine tuned on CK+).The MCAM models A −f and A were LAM trained on each of the datasets respectively.We do not report the accuracy for the action units and image features because D u and D f do not provide labels for emotional expressions by default.The second column provides the state-ofthe-art accuracy for the given datasets, while the third and fourth columns show our method's results for A −f and A, respectively.The values in each column indicate the accuracy on the test set: the testing of the learned model was performed on 20% of each used dataset.Cells colored in blue show where our method matched or improved the result of the SOTA method (second column).
The first observation is that the accuracy of the single emotional modality detectors (columns five, six and seven in Table 2) is much worse and always worse than when they are used together in the studied MCAM based approaches.It can be seen that on all three of the datasets (JAFFE, TFED, TFEID) all three detectors performed by at least 15% of accuracy below the SOTA.Also D i performed poorly on CK+ which is not regionally different from the training set of D i and D i+ , SMIC.Interestingly D a performed best on CK+ as well as did D i+ .D i+ was LAM trained on CK+ and therefore it is expected that the results would be acceptable.However, D a was only trained on FER2013 and still the accuracy on CK+ was acceptable.We observe that the detectors D a , D i and D i+ have show performance degradation with respect to regionally biased datasets.However the MCAM model showed very steady improvement using only LAM training.This means that while the information for classification of emotional expressions from different regions is available in the emotional cue detectors it is not properly used for classification on a dataset-by-dataset basis.
The second observation is that for CK+, TFED and TFEID, our proposed method was able to outperform the current SOTA approaches without being fine-tuned.Also note that in the case of JAFFE the classification using MCAM increases the result by more than 40% in accuracy when compared with the highest accuracy of emotional detector used (here D a for JAFFE ).For TFEID the increase is more than 25%, for CK+ it is 8% and for TFED the increase is up to 20%.
Observe that the MCAM A −f outperformed the A model: removing the image features increased the accuracy.Because the training data was large enough we consider that this difference is not due to the lack of training samples.Rather we cautiously conclude that the M f features might be causing interference in the classification process.
In addition, one can stipulate that even though the base models were not trained on any of the test sets under evaluation, and individual emotional detectors have a quite low accuracy, together (with AU emotional cues) they provide enough information to extract the most salient emotional expressions.One could argue that the training of the MCAM models is in a sense similar to a learning transfer on each of the individual component emotional recognition models.While such an assumption is valid, our goal here is different: instead of adapting each model to a particular dataset, we are interested as to whether the results of the originally trained models can be simply re-weighted such that decisions on new datasets can be obtained from a mixture of individual models' positive and negative results.This approach can be thus classified as a sensitivity and discrimination testing of existing methods.
Finally, our initial hypothesis that using macro expressions, action units, micro-expressions and/or M f features would allow us to bridge the gap of emotional expression in all of the selected datasets was not fully achieved: our method did not outperform the SOTA for the JAFFE dataset.

Single cue classifiers.
In order to understand the results from the composition of emotional cues, we decided to analyze MCAM models that use single emotional detectors.For this purpose we took every individual detector and used its outputs as inputs to the attention network A .The attention network was then LAM trained to the dataset under evaluation.In addition we also performed two experiments with GAM training using the D a and D i .
Table 3 shows the summary of the experiments.The first column shows the dataset under evaluation, columns two to six show the different LAM trained single detector MCAM models and the last two columns show the GAM trained A a and A i .
Unexpectedly, two of the classification tasks obtained perfect (overfitted) results.In particular the JAFFE and the CK+ dataset was perfectly classified when the A f MCAM model was used.was used In addition very high accuracy of classification was also achieved for the TFEID dataset with the A f model.The interesting result is that while JAFFE, TFEID and CK+ have their accuracies increased over the SOTA results from Table 2 the results for TFED show a decrease in accuracy.In addition the result of A a and A i GAM trained outperform their LAM trained counter parts.

GAM.
Next, to verify if better performance could be achieved by re-weighting the results of the emotional detectors we performed a set of GAM training experiments.For this purpose the D a and the D i /D i+ models were fine-tuned on the datasets used for evaluation and then cross-evaluated against the MCAM model A −f .Table 4 shows the datasets used for the GAM training as rows and the datasets used for evaluation as columns.The first observation is that GAM learning is not always a symmetric process.For instance, the model GAM trained with JAFFE has much better results on TFED than the model GAM trained with TFED on JAFFE.The improvement from GAM training is observed in a symmetric manner for TFED/TFEID, CK+/TFEID, JAFFE/TFEID and CK+/JAFFE.Moreover, with the refining of the feature detectors the accuracy on the JAFFE dataset reached, but did not improve beyond the SOTA performance.
Therefore we evaluated the GAM trained A (Table 5).The first interesting observation is that the addition of the image features extracted from using D f increased the performance of all models in a non-linear manner.Interestingly, the GAM training from any of the datasets resulted in 100% accuracy when evaluating the accuracy Table 3. Evaluation of MCAM models with individual emotional cues.Significant values are in bold.on CK+ and TFEID.For JAFFE, all GAm training using datasets beside CK+ resulted in an increase in accuracy.However, for TFED the GAM training did not result in maximum accuracy when compared to the results from Table 4 and Table 3, but the accuracy is increased with respect to the SOTA (Table 2).Also observe, that for JAFFE 100% accuracy was not achieved as is the case when only A f (Table 3).This could further confirm our initial observation that there is a general interference between the learned macro-emotions, micro-expressions and action units and the image features.While the combination and the GAM training works well for TFEID and CK+, its effect is still positive (improving over the SOTA) on the JAFFE dataset but it reduces the accuracy for TFED (for A ).This is very interesting because the simple operation of adding M f features acts on one hand as improving the emotion detection due to pixel level information and on the other hand as pixel level noise reducing the accuracy of classification.Interestingly TFED had the highest accuracy with single emotional detectors given by the D u and by the D a (Table 3).However it also had one of the lowest classification accuracies when only the A i /A i+ models were used: it can be assumed that the finer grain features or cues are be interfering with the more distinctive emotional cues and therefore using too much fine-grain information is diminishing the performance on the TFED dataset.

GAM vs LAM.
The final set of experiments was performed in order to better understand the effect of the GAM training on the individual datasets.For this purpose a set of ablation experiments were conducted.The experiments evaluate the accuracy of the A −i and A models that are component-wise GAM trained and evaluated on the same dataset.Figures 3a to d show the GAM training ablation study for each of the datasets.Each graph shows the GAM training using all four datasets (one line per fine-tuning dataset) and each data point shows GAM trained models.The first, third and fifth data points show the A with different components algorithm GAM trained while the second, fourth and sixth data point shows the A −f under same GAM trained conditions.Similarly to the previous experiments, the only GAM trained component models are D a and D i+ .
Figure 3a shows the evaluation on JAFFE.The main observation is that JAFFE is most sensitive to the addition of the M f pre-trained image features, therefore confirming the previous observations.In addition, when M f features are not used the GAM training that works the best is not self GAM training (JAFFE training, JAFFE evaluating) but rather the CK+ GAM trainining (black line, second datapoint).Finally, as can also be seen in Table 5 (column JAFFE), when the M a and M i+ algorithms are GAM trained and are used in A , the GAM train- ing on JAFFE, TFED and TFEID results in the same classification accuracy.
A similar trend can be observed for the evaluation of the ablation on the TFEID dataset (Fig. 3b).The main difference is that in this case, the self-tuning (TFEID tuning, TFEID evaluation) results always in highest accuracy independently on the ablation model evaluated, while the fine tuning using JAFFE in general results in the lowest or one of the lowest obtained accuracy.In addition, when only the M i+ or the M a and M i+ are fine-tuned, each shows the same accuracy improvement (data point three and five).
In the case of TFED (Fig. 3c) the results are different from the two previous cases.The first observation is that the best result is obtained only when all components of emotional detectors have been GAM trained (sixth data point).In addition, unlike in the previous two ablation cases, the best result is when the A −f (not A ) is used with all components detectors GAM trained.The second observation is that the self GAM training (GAM trained on TFED, evaluated on TFED) has resulted in very similar accuracy improvement profile as the GAM training on TFEID.The only considerable different case is the first data point, where only D a has been GAM trained and is used in A a .In addition when only the D i+ was GAM trained, the best GAM training is the one performed on TFEID and not the self GAM training.The final case is the evaluation of the CK+ dataset (Fig. 3d).The accuracy changes due to GAM training of individual component detectors is similar to the cases of JAFFE and TFEID: the addition of M f features results in observably increased accuracy of classification.The main difference is that in average the accuracy on the CK+ dataset is higher than on any other evaluated dataset.The reason for this difference is most probably due to the fact that both the D a and D i+ have been trained on FER13 and SMIC respectively.Both these datasets are closer in the data bias to CK+ than to any other evaluated dataset.FER13 and SMIC datasets contain mostly Caucasian and African-American individuals with few Asian individuals.In addition the D i+ was GAM trained on CK+ .
In addition, while self GAM training also generates in general the highest obtained accuracy, when only the D i+ is GAM training, the highest accuracy is given by the TFED GAM training and not self GAM training by CK+.

Discussion
The most important observation made in this work is the excluding nature of certain features.The M f image level features improved two (CK+, TFEID) out of the four datasets up to 100% accuracy.In addition M f image features also allowed us to attain 100% for JAFFE but only if no other emotional cues were used.This effect empirically indicates that for JAFFE the M f features interfere with all other emotional cues trained on other regional datasets.Such interference is also observable on TFED where M f features actually reduce the accuracy of recognition.
Combining these observations can lead to a model for emotional recognition: in order to recognize emotional expressions for various regional groups, certain emotional cues used to identify emotions in some regional group are preventing the efficient recognition of emotion in another regional group.This would indicate that the emotion recognition bias is not due to lack of effective emotional cues but rather due to the lack of effective forgetting of pre-learned emotional cues.For instance in JAFFE only the fine-grain image-level features seem to be effective.This would indicate that emotional expression in JAFFE are the least expressive.On the other hand the emotional expression recognition accuracy in TFED is reduced by introducing the image-level features: this would indicate that emotions in TFED are closer to the pre-learned model on FER2013 than is JAFFE.
Finally one can summarize the expressed emotions to a relational graph.TFEID and CK+ represent emotional models that are effectively described by all emotional cues, JAFFE is exclusive to only image-level features and TFED is exclusive to explicitly not use image-level features.
The perfect accuracy of classification on the various test sets, shows a possibility of overfitting.Therefore the results provided in this paper can be seen as an approximate evaluation.The datasets did not have the same number of images and there were only six output categories.With a relatively small number of images the classifier  might not have enough of parameters or has too many (and thus memorizes) to properly learn the image.In addition the empirical experiments in this work showed bias between different regional based datasets.However we did not mix datasets and we did not analyze them individually.Therefore the origin of the bias can be the data content (different regional groups), collection method (local, random, etc), data set quality (noise, resolution, etc) or the lack of diversity within the collected data.
While the overfitting of the classification is a negative result, it shows which emotional cues and image features are necessary to perfectly embed or memorize the emotional expressions.Interestingly the overfitting happens under different combinations of emotional cues for the different datasets.Looking at the results, the most overfitting occurs when the VGG-11 features are used.This would indicate that the number of the features is enough to embed the whole dataset entirely.However because this trend is not absolute we also detected some interference between features and emotional cues.

Conclusion
In this paper we empirically showed that there is an observable bias when one attempts to generalize the current SOTA emotion recognition methodology.We were able to bridge the bias between some datasets but we also showed that some datasets are not directly compatible: for high accuracy emotion recognition they use mutually interfering features.In addition, the distinctive and occasionally contradictory nature of regionally-specific emotional expression can serve to obstruct and confound the transfer of learning of emotional recognition based on the facial expressions of one region to that of another, thus preventing the achievement of a "perfect" classifier.As we determined, it is sometimes better to "forget" what was previously learned, and proceed from first-principles.In such cases, it is our hypothesis that a fusion-scoring approach across multiple categories may nevertheless partially overcome this obstacle to improve (but not perfect) the emotion recognition outcomes.
As a future work we plan to investigate the weight of individual emotional cues in emotion detection in different ethnic datasets and develop a generative model for emotion expression generation.

Figure 1 .
Figure 1.The four kinds of features used in our proposed facial emotional expression detector.

Figure 2 .
Figure 2. Overview of the proposed MCAM framework.
Fine-tuning of component models for CK+ data set

Figure 3 .
Figure 3. Ablation experiments on all four datasets evaluation samples.

Table 2 .
Accuracies in percentage for the four datasets used for testing MCAM approach.Significant values are in bold.

Table 4 .
Accuracy of fine-tuned A −f on base models in percentage.Significant values are in bold.

Table 5 .
Accuracy of fine-tuned A on base models in percentage.Significant values are in bold.