Abstract
Deep learning techniques have proven to be effective in solving the facial emotion recognition (FER) problem. However, it demands a significant amount of supervision data which is often unavailable due to privacy and ethical concerns. In this paper, we present a novel approach for addressing the FER problem using multisource transfer learning. The proposed method leverages the knowledge from multiple data sources of similar domains to inform the model on a related task. The approach involves the optimization of aggregate multivariate correlation among the source tasks trained on the source dataset, thus controlling the transfer of information to the target task. The hypothesis is validated on benchmark datasets for facial emotion recognition and image classification tasks, and the results demonstrate the effectiveness of the proposed method in capturing the group correlation among features, as well as being robust to negative transfer and performing well in fewshot multisource adaptation. With respect to the stateoftheart methods MCW and DECISION, our approach shows an improvement of 7% and \(\sim\)15% respectively.
Similar content being viewed by others
Introduction
Facial expressions play a crucial role in social communication as they are good indicators of the emotional state and intents of humans^{1}. Understanding one’s emotional state is pivotal in providing the responses one intends for their actions. With the advances in deep learning and sensor technologies, there has been increased attention to capturing the emotional state of the person from the facial expression^{2,3}. Automatic facial emotion recognition has seen wide applications in scenarios where data acquisition opportunities are limited such as among individuals with intellectual disabilities, children with autism, etc.^{4}, robotics^{5,6}, entertainment^{7,8,9}, assistive systems^{10} and more.
Recent studies have achieved tremendous progress in facial emotion recognition using deep learning techniques in situations where they are furnished with a large amount of annotated data^{11,12}. The FER datasets usually have a limited amount of data samples and differ in emotion labels, poses and conditions of data collection, which limit their compatibility with deep learning frameworks. Procuring large amounts of facial expression data is also challenging given the privacy concerns related to the sharing of identifying facial images, time and resource constraints. Moreover, there are pretrained classifiers available for identifying the emotional labels in these data. Each of these networks has the knowledge gained from the collected set of facial emotion patterns. To address the challenge of data scarcity, recent research has manoeuvred transfer learning techniques to relay the knowledge captured from one domain to another.
With the availability of multiple datasets, MultiSource Domain Adaptation (MSDA)^{13} has gained interest, wherein multiple labelled source domains are used to transfer the learnt knowledge to the target domain. The generalizability of multisource transfer learning in providing a broader view of the target domain has been demonstrated in prior works^{14,15,16,17}. One common approach for multisource domain adaptation is to align the source and target feature representations and reduce the classification loss on the source data^{18,19}. Another approach is to encourage agreement across source target predictions rather than working on the feature representations^{20}. Similar to its success in domains like text, we hypothesise that in FER as well, the domain information provided by multiple source tasks can be leveraged in capturing the underlying characteristics of emotion expression in humans across domains irrespective of the variations in poses, cultures, ethnicity and condition mismatch.
All the above methods assume access to the source data for adapting the source knowledge to the target domain. In practical scenarios owing to privacy, security, and management reasons, only a trained source model is available where access to the source data, as well as control over the source training, is restricted. In this work, we explore the multisource domain adaptation (MSDA) setting where only multiple pretrained source models are available for supervising the domain adaptation while the source datasets are not accessible. Recent research addresses this by adapting single source^{21} or multiple sources as in DECISION method^{22} to the target domain without access to the source data, meanwhile assuming that sufficient target data is accessible. But in practical scenarios like facial emotion recognition in children with autism, medical data and so on, acquiring sufficient training data is challenging owing to the distinctive nature of the cohort. Inspired by prior work, Maximal Correlation Weighting (MCW)^{14}, we assume a fewshot setting where few labelled target samples are available for supervising the adaptation. To summarise, we aim to address the problem of FER by constructing a multisource domain adaptation problem where the source dataset is unavailable, whereas we have access to a small target dataset with few samples.
To address the abovedefined problem, we leverage the maximal correlation approach^{14}, where the features generated by the pretrained networks are represented as maximal correlation functions. We particularly look into the multivariate correlation^{23} of the source features with the target domain, thereby capturing the complex association between the highdimensional source features and the target. To sum up, the main contributions of this work are:

We propose a multisource domain adaptation approach for facial emotion recognition by leveraging the multivariate maximal correlation analysis using a few labelled target samples without access to source data.

We evaluate our approach on the FER task by conducting extensive experiments on benchmark FER datasets. Experiments show that our approach consistently improves the results over the best singlesource model. Further, our approach outperforms stateoftheart FER  MCW^{14} and DECISION^{22} methods across multiple datasets.

We also show the ability of the approach to generalize over domains outside FER by performing a general image classification task with the CIFAR100 dataset.
This paper is organized as follows: In Section "Related work", we present the related work. In Section "Preliminaries", we introduce the preliminary concepts used in the proposed work, and in Section "Proposed model for multisource transfer learning using multivariate correlation analysis (MSTLMCA)", we develop the proposed method for multisource transfer learning. In Section "Experimental setup", we demonstrate the experiment details on the facial expression datasets as well as on standard image datasets and discuss the results in Section "Results and analysis" and Section "Conclusion" summarises the work.
Related work
Recent research focuses on deep learning techniques for automatic facial emotion recognition. This section discusses various deep learning approaches present in the literature for facial emotion recognition.
Transfer learning in facial emotion recognition
FER has witnessed a breakthrough with the advent of deep learning techniques, which eliminated the tedious preprocessing phase and provided endtoend solutions from the input visual information to the emotion recognition. An endtoend learning framework based on a deep region and multilabel was proposed for the detection of facial action units in^{24}. Another approach shows that combining multiple networks shows better performance in automatic facial emotion recognition. In this approach, CNNLSTM and C3D networks were used in conjunction to simultaneously model video appearances and motions^{25,26} proposed a method that is robust to variations in expression intensity by learning the spatiotemporal feature representations for FER. In^{27}, face detection with face alignment deep neural network with inception layers is used to address the FER problem. Research also shows that preprocessing the images before feeding them to deep neural networks improves the classifier performance. Preprocessing image data before being fed into a convolutional neural network (CNN) has shown to have a positive effect on the learning process^{28}. In this, rather than feeding raw input, the data were preprocessed to extract expressionspecific features from a face image and were then fed into a convolutional neural network for emotion recognition. Convolutional neural network with attention mechanism (ACNN) has been shown efficient in perceiving the occlusion regions of the face and has been used to recognize facial emotions in the wild in the presence of occlusions^{29}.
One of the characteristics of these deep learning methods is the need for a large amount of data for training the deep neural network architecture. Training the deep learning framework with the relatively smaller FER datasets leads to overfitting. Access to such a large collection of data is often challenging, especially in applications involving children. Further, annotating data for FER is an extremely timeconsuming and resourcesensitive process.
To mitigate this, there are several studies that propose transfer learning techniques where knowledge gained from models pretrained on similar large datasets is transferred to the domainspecific learning task. Knyazev et al.^{30} proposed an ensemble of industrylevel face recognition networks pretrained on large facial emotion databases such as FER2013 for emotion recognition. Aly et al.^{31} proposed a multistage Progressive Transfer Learning method by finetuning the Alexnet convolutional network and demonstrated the FER performance on VTKFER and 300W datasets. Ngo et al.^{32} demonstrated a transfer learning approach using the SEResnet50 model pretrained on the VGGFace2 database along with a novel cluster loss function to transfer the highlevel features learned by the network to the FER. These methods leverage a single source transfer learning approach where the source networks are trained on data from a single domain.
Multisource domain adaptation
With the availability of a large number of datasets, even though with limited data samples, it is an intuitive step to take advantage of the diverse information comprehended by the different sources. Multisource transfer learning has been explored widely in text classification^{33}, pattern recognition in EEG signals^{34}, speech recognition^{35} etc. One of the approaches for multisource transfer learning relies on the assumption that the target task can be represented as a weighted combination of the source tasks^{36}. One common approach to learning these combination weights in multisource transfer learning is latent space transformation, which learns a common function across the different source tasks by optimizing the overall loss function. Guo et al.^{37} proposed a supervised multisource domain adaptation method by establishing a set of distance measures to add to the loss function to be minimized for better domain adaptation. Zhao et al.^{38} used \({\mathscr {H}}\)divergence to measure the distance between two domain distributions^{13} employed adversarial methods using GAN loss for generating domainindependent feature representations. Metalearning models (MAML)^{39} have been developed, which can be used as a starting point for learning a good model finetuned to a target task, using only a few local gradient steps. With performance gap as a measure of divergence of source and target distribution and instance weighting, Wang et al.^{40} proposed a boosting approach for transfer learning exploiting the label information in the target domain.
Many of the MSDA approaches train domainspecific classifiers and learn a weighted ensemble of these source classifiers for the target prediction^{13,14,41,42}. These methods expect access to the target dataset for learning a rule for combining the source classifiers. Guo et al.^{37} used a pointset distance metric and metalearning approach to combine the source models for target prediction^{41}. Yue et al.^{18} exploited domaininvariant and class discriminative features augmented with alignment loss for MSDA. Ahmed et al.^{22} addressed the MSDA problem without accessing the source data by employing Information Maximisation (IM) and pseudolabeling strategy^{22}. Their approach demanded sufficient target data for training the ensemble source network, which may not be practical in many applications. Lee et al.^{14} introduced a multisource transfer learning method in image classification, which also addresses the data privacy concerns of the transfer learning methods. In this, the knowledge gained by the multiple source networks can be transferred to the target task without access to source samples. Considering the pretrained source networks as black boxes, they used bivariate maximal correlation analysis to train the ensemble of source networks and a weighted combination of features extracted from the source networks was used to build the target classifier. This approach considers the features independent of each other and leaves out the group correlation among the features within each source while combining the source networks. Recent studies show evidence of better generalization in multisource transfer learning when compared to singlesource one in FER applications^{43}.
The review shows that there have been limited studies exploring the possibilities of multisource transfer learning in automatic FER. These studies either worked on singlesource transfer learning for e.g.^{44} or required access to the source datasets for domain adaptation for e.g.^{31}. Inspired by the success in multisource transfer learning in other domains like NLP, for e.g.^{45,46} and considering the challenge of data scarcity in FER, we propose a multisource transfer learning approach to train a target classifier from a weighted ensemble of pretrained source networks trained on different source datasets. We utilize the features extracted from different pretrained source networks and construct a target classifier for the target FER task. Our method for aggregating the features from pretrained networks relies on the hypothesis that in realworld situations, a feature may exhibit a weak correlation with the target class when considered individually, but when taken into consideration together, they can generate a strong correlation^{47}. To address this, we propose the use of multivariate maximal correlation to determine the weights of the source networks that contribute to the target classification task. We leverage the Alternating Conditional Expectation (ACE) based^{23} method, which captures the nonlinear association among random variables in a multivariate setting. We further perform fewshot training with target samples for learning the target classifier to demonstrate the effectiveness of our method in scenarios with limited training data.
Preliminaries
Multivariate maximal correlation analysis
The maximal correlation was first introduced and developed by Hirschfeld^{48}, Gebelein^{49}, and Rényi^{50} as a measure for the nonlinear association between two random variables \(X_1\) and \(X_2\). It measures the strength of association among two random variables and characterises the nonlinear transformations of the variables. We analyse the multivariate correlation of the features on the target classifiers and build an effective and computationally efficient approach for multisource transfer learning.
Definition 1
(Maximal Correlation) Given two jointly distributed random variables \(X, Y \in {\mathscr {X}}\) with positive variance, the maximal correlation of (X, Y) is defined as:
where expectations are with respect to joint distribution \(P_{X,Y}\). \((f^*,g^*)\) are referred as maximal correlation functions.
Maximal correlation is equal to the second largest singular value of a scaled joint probability distribution matrix. The singular vectors of the scaled probability distribution matrix could characterize the optimal transformations of the variables when they are discrete^{23}. Given \(f^* = \{f_1,f_2,...\}\) and \(g^* = \{g_1,g_2,...\}\) with the associated singular values \(\rho _1,\rho _2,...\) the joint probability distribution \(P_{X,Y}\) is given by^{51} :
and
In the case of the system of continuous random variables, most of the correlation measurements consider the pairwise relationship between the variables. In realworld datasets, data instances are represented as high dimensional multivariate random variables \((X_1, X_2,...X_d)\). Extending definition 1 to multivariate random variables, maximal correlation among realvalued multivariate random variable \(X = \{X_i\}^d_{i=1}\) can be given as
Using bivariate measures to capture the multivariate relationships may not be efficient in capturing the association among the variables^{52}. Methods like Maximal Information Coefficient (MIC)^{53} and Canonical Correlation Analysis (CCA)^{54} consider either two dimensions or linear correlations. In realworld scenarios, a feature may correlate weakly with the target class if considered individually, but when considered as a group, it can lead to a strong correlation^{47}. Further, it is computationally expensive to evaluate all the pairwise relations. Thus, the computation of maximal correlation in multivariate data eventually turns into an optimization problem with complexity quadratic to the dimension, i.e. \(O(n^2)\) where n is the feature dimension. By the above approach, for a n dimensional data to find the correlation among the random elements, each \(X_i\) is paired with \(n1\) other elements, and solving the maximal correlation means optimizing these \(n(n1)/2\) transformation functions. Multivariate maximal correlation analysis solves this by considering the group correlation among the features^{52}. Maximal correlation eliminates the assumptions on data distribution and captures nonlinear relations.
Based on Alternating Conditional Expectation (ACE)^{23}, proposed a computationally efficient method for addressing multivariate maximal correlation. It determines a single transformation function corresponding to each random variable, thereby reducing the computational complexity of computing multivariate maximal correlation. This approach maximizes the aggregate inner products between transformed variables to optimize the correlation functions. Given a system of continuous random variables, this approach infers nonlinear transformation functions assigned to each variable represented as vertices of a graph such that the aggregate pairwise correlations over the graph G are maximized. The ACEbased approach for computing multivariate maximal correlation is given in Algorithm 1.
Definition 2
Let \(G = (V,E)\) be a graph with vertices \(V = \{1,2,....n\}\) and edges \(E \subseteq \{(i,i'):i,i' \in V,i \ne i'\}.\) The multivariate maximal correlation of \((X_1, X_2...X_n)\) given G is
such that \(\mathop {{\mathbb {E}}}[f_i(X_i)]=0, \text {and} \mathop {{\mathbb {E}}}[f_i(X_i)^2]=1, \forall 1 \le i \le n\)
Proposed model for multisource transfer learning using multivariate correlation analysis (MSTLMCA)
Problem Setting: We formulate the facial emotion recognition with scarce data as a multisource domain adaptation (MSDA) problem, in which there are N labelled source domains and one target domain with few labelled samples. Let the input space be \({\mathscr {X}}\), and the classification is among M categories. We represent the pretrained source models as \(\{\theta _S^j\}_{j=1}^N\) where the \(j^{th}\) model is represented as \(\{\theta _S^j\}: {\mathscr {X}}\rightarrow \mathop {{\mathbb {R}}}^{M}\) is a classifier learnt from source dataset \(D_S^j = \{x_{S_j}^i,y_{S_j}^i\}_{i=1}^{N_k}\), with \(N_k\) data points. \(x_{S_j}^i\) denotes the \(i^{th}\) source image in source \(S_j\) and \(y_{S_j}^i\) denotes the corresponding label. Given a target dataset \(D_T = \{x_T^i,y_{T}^i\}_{i=1}^{N_T}\), with few samples, the problem we are addressing is to learn a classifier \(\{\theta _T\}:{\mathscr {X}}\rightarrow \mathop {{\mathbb {R}}}^{M}\) using the ensemble of pretrained source classifiers without access to source datasets. The data points are facial expression images represented by \((x_1,y_1),...(x_n,y_n)\) where \((x,y)\in {\mathscr {X}}\times \{1,2...M\},\) the feature \(x \in \mathop {{\mathbb {R}}}^d\) is sampled from input space \({\mathscr {X}}\) and label \(y \in \{1,2,..M\}\). In the absence of source training data, we leverage the knowledge learned by N pretrained networks trained on similar but different source datasets and learn the classifier \(\theta _T\), which has a low classification error on the target dataset. The highlevel overview of the proposed architecture is given in Fig. 1.
We represent each source model \(\{\theta _S^i\}\) as the composition of two transformations :

(1)
The mapping \(f_S^i\) transforming the input vector into feature vector of length \(d_i\), \(f_S^i: {\mathscr {X}}\rightarrow \mathop {{\mathbb {R}}}^{d_i}\) where \(d_i\) is the length of the feature vector of source i

(2)
A classifier \(h_S^i: \mathop {{\mathbb {R}}}^{d_i}\rightarrow \mathop {{\mathbb {R}}}^M\) from the feature vector into the output label, \(Y^{s_i}\). This forms the hypothesis function.
Thus \(\theta _S^i = (f_S^i \circ h_S^i)\)
For the sake of better explainability, we have considered the feature length to be the same for all the source tasks and \(d_i=d_j,\forall i,j =1,2,...N\). To build the target classifier, given N source tasks, with feature functions \(\{f_S^i\}_{i=1}^N\), we optimize respective \(\{g_S^i\}_{i=1}^N\) which is the hypothesis function such that the aggregate maximal correlation of functions \(f_S^i\) and \(g_S^i\) given by
where \({{\hat{P}}_{X,Y}^T}\) is the empirical joint distribution of the target data.
For each source, the optimal correlation function, \(g_S^i\) corresponding to feature function \(f_S^i\) and the corresponding correlation coefficient could be computed^{14} as
While considering the high dimensional image data, it is interesting to analyze the group correlation of the multivariate data rather than the binary correlation among the individual features, \(f_S^j(X)\) and \(g_S^j(Y)\). Multivariate correlation analysis may reveal hidden complex interactions affecting the classification task^{23}. Hence, we leverage the multivariate correlation among the group of features extracted by the feature extraction layer to compute the function \(g_S^j(Y)\). In this direction, we apply network maximal correlation, an ACEbased multivariate maximal correlation approach given in definition 2, which characterizes the multivariate nonlinear association between random variables.
We train the ensemble of the source classifiers on target samples to optimize \(g_S^i\) to maximize the aggregate maximal correlation given in equation 6 i.e.
The correlation value for each pair of \((f_S^i,g_S^i)\) gives the strength of association between the functions. Since we are considering the group correlation of features with the target, the \(\rho ^i_S\) represents the combined weighted contribution of the feature functions of each source network to the ensemble classifier for the target domain.
Finally, the prediction of the target label on the test data is given by
where
The procedure for the NMCbased multisource learning is given in Algorithm 2.
Experimental setup
Task and datasets
Facial Emotion Recognition To understand the performance of our approach, we designed a set of experiments on the FER task using four FER datasets: FER 2013, RAFDB, JAFFE, and CAFE under different sourcetarget settings. Further, we investigated the efficiency of the approach on a novel FER dataset, the Child Facial Expression Dataset (CFED), curated by the authors. The dataset details are given below:

FER2013 dataset^{55} The 2013 Facial Expression Recognition dataset (FER2013) is a dataset provided by Kaggle, introduced at the International Conference on Machine Learning (ICML) in 2013^{56}. The dataset contains 35887 images, and each image has been categorized into 7 different types of emotion categories. The images in the dataset are registered hence the face appears in the center of the image dataset.

JAFFE dataset^{57}: The Japanese Female Facial Expression (JAFFE) dataset consists of 213 images of different facial expressions from 10 different Japanese female subjects.

RAFDB dataset^{58}: The RAFDB dataset has 29672 realworld images labelled with 7 basic emotions and 12 compound emotions.

CAFE dataset^{59} The CAFE set features the facial expression data of a racially and ethnically diverse group of 2 to 8yearold children posing for six emotional facial expressions and neutral emotion. The CAFE dataset consists of facial expression data of 90 female and 64 male children from varying ethnicities.

CFED dataset: The Child Facial Emotion Dataset (CFED) was collected, annotated, and prepared by our research group. There are limited annotated facial datasets for child facial emotion expression especially in the global south where active research in child emotion recognition is limited. The CFED dataset was collected by video search on child videos from YouTube under the Creative Common Licence, which allows the use of the videos for research. The manually retrieved video frames with expressed emotions were annotated by the research team. It consists of 606 images of children from Indian ethnicity representing 6 emotion classes  Anger, Fear, Happy, Neutral, Sadness, and Surprise.
For our experiments, we used the six emotional classes  Anger, Fear, Happy, Neutral, Sadness and Surprise from the FER datasets: FER 2013 (F), RAFDB (R), JAFFE (J), and CAFE (C). Each domain has 600 labelled samples for training, i.e. 100 from each class label, and the testing set has 60 samples, i.e. 10 from each class label. Samples from each FER dataset are represented in Fig. 2
Image classification We further considered the image classification to demonstrate the generalizability of the approach. For this, we conducted experiments on the benchmark image dataset CIFAR100. We followed the specific experiment setting proposed by Lee et al.^{14}.

CIFAR 100: The CIFAR100 dataset has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. For our experiment, we have considered ten different source tasks, each consisting of 2 nonoverlapping classes. All images were resized to 32x32, and the pixel values were normalized to zero mean and unit variance.
For our experiments, we randomly selected 10 nonoverlapping class categories from the source task. For training, each source dataset had 500labeled samples per class. Samples from the CIFAR100 dataset are represented in Fig. 3.
Experiments
We investigated the performance of the MSTLMCA approach on multiple FER datasets. In this, we compare our approach against two different baselines, which are commonly followed in the literature. The first one is the best singlesource adaptation among the other sources (bestSS)^{37}, and the other is a unified multisource model (uniMS) where all the source data are combined to form a single source and singlesource domain adaptation is performed on the target data^{60}. There is very limited work in multisource domain adaptation without access to the source data. We compared our approach with Lee et al.^{14}, where they have considered feature independence while performing maximal correlationbased source weighting for multisource domain adaptation. Further comparison was made with the DECISION^{22} where unsupervised multisource domain adaptation is being addressed using Information Maximisation loss and clusteringbased pseudo labelling. We conduct experiments by selecting the target dataset in a roundrobin fashion from among the set of FER datasets and keeping all the other datasets as the source datasets. We extended our experiments by evaluating the method on a novel CFED dataset as the target and the other standard datasets as the source task.
We further conducted experiments on image classification with the CIFAR100 dataset. We followed the same experiment setting^{14} for the comparison. For our experiment, we have considered ten different binary classification tasks as sources, each consisting of 2 nonoverlapping classes. The source tasks were trained with 500 samples to generate the source network weights.
Implementation details
Pretraining In our experiment on FER, we constructed 6way (anger, sad, happy, surprise, neutral, fear) emotion classification on different FER datasets as the source tasks. The disgust class was discarded as it was not present in all the FER datasets considered. It is important to note that the source data samples were used only for pretraining the source tasks and not for training the target classifier. In other words, the source data samples were used to create a foundation or base knowledge for the source tasks but not for directly training the target classifier. This distinction is important because it highlights the importance of separating the pretraining and training stages and the potential benefits of using pretrained networks for feature extraction. In realworld scenarios, the assumption is that these pretrained source networks are available for feature extraction but are not trainable. Similarly, for the image classification task, we constructed binary classification tasks CIFAR100 classes. We selected 10 nonoverlapping pairs of classes from CIFAR100 for classification in the source tasks.
All images were resized to 32x32, and the pixel values were normalized to zero mean and unit variance. We used ResNet18 architecture^{61} similar to^{21,22} for pretraining the source tasks with parameters tuned for the specific dataset considered. We repeated all the experiments with LeNet architecture as well, which is a simple convolutional neural network architecture. The CrossEntropy Loss was used as the loss function, and the Stochastic gradient descent (SGD) optimizer was used with the learning rate equal to 0.001, momentum set to 0.9, and the number of epochs to 100. These networks were considered as the black box pretrained networks for the multivariate correlation analysis, where the features from the pretrained networks will be extracted and further trained on target samples to compute correlation functions and coefficients for each set of features extracted from the pretrained source networks, which will be used for the classification of the target test samples. Here, the black box implies that we do not have control over the training of the source networks but can only generate features pretrained on these networks.
Training Once the pretrained networks are available, the ensemble network is trained with 1, 5, 10, and 20 samples from the target task, during which the multivariate correlation functions and the correlation coefficient are computed. The training was done on Ubuntu Server 20.04 LTS, and the GPU used for training was Nvidia’s RTX 3090. We used the PyTorch framework for all our implementations. To ensure reproducibility and to support open source, the code and the CFED dataset will be made available on request.
Results and analysis
Facial emotion recognition
Our approach focuses on multisource domain adaptation without the need for source data for domain adaptation while also addressing the challenge of limited target data, where only a small number of target samples are available for training. It should be emphasized that in this method, the source data is utilized solely for pretraining the source models. Most recent studies in multisource domain adaptation, to the best of our knowledge, require labeled data from both source and target domains, as well as a mechanism for learning domaininvariant representations. For a fair evaluation, we compared our method with MCW^{14}, which is similar to our approach, which addresses sourcefree multisource domain adaptation. Additionally, we compared our supervised approach with the DECISION^{22} algorithm, which also tackles the problem of multisource domain adaptation, even though it is an unsupervised approach.
We report our results on FER datasets in Table 1. We observe that our method consistently performs better across the different dataset settings and tasks. We observe a mean improvement of \(\sim 12\%\) with respect to the best single source performance and compared to the uniMS, our method gives \(\sim 15\%\) improvement in performance (Table 1 in Average column). Further, in cases of negative transfer, indicated as (\(^*\)), our approach is performing better, indicating that it is robust to negative transfer. Negative transfer happens when transferring knowledge from a less related source, which may inversely affect the target performance. It is shown in cases where the best singlesource model outperforms the unified multisource model, indicating the adverse effect from unrelated sources. Compared to the MCW method, MSTLMCA gives an improvement of 3.74% improvement. This signifies that group correlation among the features is capable of capturing the differentiating features in multisource adaptation, and hence, the classification accuracy is higher. Even though an unsupervised algorithm, the DECISION approach addresses multisource adaptation with similar settings. We compared our results with DECISION and obtained an improved performance of \(\sim {11}\)%.
Further, even with the newly curated CFED dataset, our proposed approach confirms its efficiency with similar trends in performance. The results for the CFED dataset are given in Table 2. The results reported are for 20 shots. With respect to the bestperforming model, i.e. the MCW, it shows an improvement of 7% and \(\sim 15\%\) with DECISION. We have run the experiments for different shots, and the results are given in Table 3. The results show that the proposed method performs better in fewshot settings. This analysis illustrates that our algorithm’s performance significantly improves up to 20 shots, after which it gradually converges. At this point, the model with joint training approach has received a sufficient number of samples to learn their parameters and the addition of more samples no longer yields significant knowledge gains.
Maximum correlation analysis To study the effect of multivariate maximal correlation in regulating the flow of knowledge from the source to the target task, we conducted the correlation analysis between source and target pairs. For this, we considered CAFE, FER2013, RAFDB, and JAFFE as the source datasets and CFED as the target dataset. We computed the correlation coefficient corresponding to each source task for 20 runs. We then compared it with the correlation weighting of the sources computed by MCW^{14} under the same settings. The correlation coefficients for the different source tasks using MSTLMCA and MCW are given in Fig. 4. The results show that the correlation weighting of our approach for each source is clustered closely around the median when compared to the MCW method, where the weights learned are more variable to the input samples under consideration. This shows that our approach could produce more reliable and accountable results by consistently focusing on the relevant source knowledge over different runs. This accounts for the ability of the model to produce better results than stateoftheart methods, as seen in Table 2.
For further analysis, we removed the source task with the highest correlation value given by our algorithm, i.e. JAFFE (J), and computed the accuracy for the adaptation task. We observed that average accuracy dropped to 41.99% with a relative drop of \(\sim 7\%\). Likewise, removing the task with the lowest weightage given by our algorithm which is CAFE (C), and keeping the other tasks dropped to 44.23% with a relative drop of \(\sim 2\%\). With this, we can infer that removing the highly correlated sources leads to a significant drop in accuracy, showing that the source task with high correlation contributes higher to the target classifier learning. Similarly, we compared the effect of multivariate correlation in the classification task. We compared the correlation strength of our proposed method with the MCW^{14} approach, where binary correlation weighting has been used. The results in Table 2 show that multivariate group correlation could capture the relevant source knowledge in a consistent and reliable way eventually leading to better performance.
Statistical Analysis To further validate our results, we perform statistical analysis. For the null hypothesis, we assume that our proposed model works similar to other algorithms and consider the average accuracy for all the algorithms. We tried 20 different samples for all classifiers on the CAFE dataset and then performed the KruskalWallis Htest (also called oneway ANOVA test on ranks) and the Friedman test. We found the KruskalWallis H statistic equal to 65.38, which shows significant statistical importance and outputs a very small \(p = 9.33e13\). Similarly, for the Friedman test, we got a statistical value of 72.37 and \(p = 3.28e14\). As the pvalue is very small in both the tests and \(p<0.05\), we can safely reject the null hypothesis. Hence, we can infer that the performances of all algorithms are not equivalent.
Considering that the null hypothesis was rejected, we have two scenarios for a posthoc test^{62}: (1) We perform the Nemenyi posthoc to compare all algorithms with each other. (2) We perform the BonferroniDunn posthoc test to compare all the algorithms with a control algorithm (i.e., the proposed method). Both the posthoc tests are performed with alpha values 0.05 and 0.1 as suggested by^{62}.
To perform both the posthoc tests, we calculated the average rank of each algorithm, as shown in Table 4. Average rank (or fractional rank) denotes the algorithm’s performance, i.e. a lowerranked algorithm performs much better than a higherranked algorithm. It is calculated by taking the mean of ordinal ranking, which is done by the simple ordering of the accuracies of respective algorithms. The results given in Table 4 show that our proposed method has a lower rank than other methods and hence outperforms others.
Then, we compute the critical differences (CD) as per Nemenyi and BonferroniDunn tests plotted in Fig. 5. In the CD diagram, closely performing algorithms are grouped into a single group. Figure 5 shows the graphical representation of the classification accuracies for our problem on the six different methods. In the CD diagram, the lowest (best) ranked algorithms are on the right side of the graph. Hence, the results reveal that UNIMS, BESTSS, and DECISION^{22} perform significantly worse than MSTLMCA (proposed method) and MCW^{14}. Further, it can be observed that MSTLMCA (for both LeNet and ResNet18) have the lowest ranks among all. This implies that the MSTLMCA outperforms the other approaches.
Image classification
The results of multisource adaptation on image classification in the CIFAR100 dataset are given in Table 5. We could see a similar performance of our method on the image classification task as in the FER task. Our method performs better with an improvement of \(\sim 6\%\) in comparison with the stateoftheart method MCW. It further shows comparable results with DECISION.
Elbow point analysis We performed the elbow point analysis on the CIFAR 100 dataset to find the optimal kvalue for the kshot learning approach we used. We can observe from Fig. 6 that in the CIFAR100 dataset, after 20 shots, the rate of growth in the accuracy is significantly lower concerning shots. So, we can deduce that the elbow point or the knee of the curve is at 20 shots for the CIFAR dataset, and even with a smaller number of samples the algorithm is capable of training the classifier. This shows that our approach has utility in applications, including FER, where there is an unavailability of huge training datasets.
Maximum correlation analysis We conducted maximal correlation analysis on the CIFAR100 dataset with the same settings given in Section "Experimental setup". The weights for the source tasks for the CIFAR100 dataset are given in Fig. 7. Similar to the FER task, we can see that the correlation weighting of our approach is consistent across the different runs, as represented by the lower spread of the weights.
Conclusion
In this work, we proposed a multisource transfer learning approach by leveraging the multivariate maximal correlation of features extracted from an ensemble of source networks to build a target classifier with unseen classes. We measure the multivariate nonlinear association among the features of the source networks using Network Maximal Correlation and optimize the aggregate multivariate maximal correlation over the source tasks to learn the target classifier. The results show that capturing the group correlation of the features with output, as proposed, significantly improves the learning of the target classifier.
We demonstrated the efficacy of our approach in facial emotion recognition using benchmark datasets. We verified and confirmed the performance on the novel CFED dataset with images from YouTube. We investigated the performance of the proposed method in the crosscultural target classification task by considering the different FER datasets as the source dataset and our novel CFED dataset consisting of facial emotion data of children of Indian ethnicity and having limited samples. We then performed an image classification task using a standard image dataset, the CIFAR100. We have also shown that the proposed method convincingly performs well even in smaller target datasets with our experiments of kshot learning with k less than ten shots.
The proposed method enables combining the knowledge from the multiple source networks in an effective and computationally efficient manner and can be leveraged where training data is limited. Further, since the knowledge gained by the source classifier is leveraged to build the target classifier without direct access to the input data in this approach, it ensures improved data privacy which is primal in facial emotion expression data. The proposed method can be generalized to other domains as well while applying transfer learning. The performance of the approach with heterogeneous source tasks with multimodal information can be explored in future work.
Data availability
The image data sets used are available in the public domain and are available upon request, except for the CFED dataset. Owing to privacy concerns, facial expression data (CFED) cannot be made publicly available. However, to ensure the transparency and reproducibility of the research, interested researchers may contact the corresponding author (jainendra@iiitd.ac.in) to discuss potential access to a sanitized version of the dataset in compliance with the applicable confidentiality regulations and ethics requirements. The links to access the public datasets are given in Table 6.
References
Ekman, P. Facial expression and emotion. Am. Psychol. 48, 384 (1993).
Esmaeili, V., Mohassel Feghhi, M. & Shahdi, S. O. A comprehensive survey on facial microexpression: Approaches and databases. Multimed. Tools Appl. 81, 40089–40134 (2022).
Schoneveld, L., Othmani, A. & Abdelkawy, H. Leveraging recent advances in deep learning for audiovisual emotion recognition. Pattern Recogn. Lett. 146, 1–7 (2021).
Shanok, N. A., Jones, N. A. & Lucas, N. N. The nature of facial emotion recognition impairments in children on the autism spectrum. Child Psychiatry Hum. Dev. 50, 661–667 (2019).
Bennett, C. C. & Šabanović, S. Deriving minimal features for humanlike facial expressions in robotic faces. Int. J. Soc. Robot. 6, 367–381 (2014).
Breazeal, C. Emotion and sociable humanoid robots. Int. J. Hum Comput Stud. 59, 119–155 (2003).
Hyde, J., Carter, E. J., Kiesler, S. & Hodgins, J. K. Assessing naturalness and emotional intensity: A perceptual study of animated facial motion. In Proceedings of the ACM Symposium on Applied Perception, 15–22 (2014).
Zhan, C., Li, W., Ogunbona, P. & Safaei, F. A realtime facial expression recognition system for online games. Int. J. Comput. Games Technol.https://doi.org/10.1155/2008/542918 (2008).
Kaur, A., Ghosh, B., Singh, N. D. & Dhall, A. Domain adaptation based topic modeling techniques for engagement estimation in the wild. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–6, https://doi.org/10.1109/FG.2019.8756511 (2019).
Assari, M. A. & Rahmati, M. Driver drowsiness detection using face expression recognition. In 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), 337–341 (IEEE, 2011).
Li, S. & Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 13(3), 1195–1215 (2020).
Debnath, T. et al. Fourlayer convNet to facial emotion recognition with minimal epochs and the significance of data diversity. Sci. Rep. 12, 6991 (2022).
Xu, R., Chen, Z., Zuo, W., Yan, J. & Lin, L. Deep cocktail network: Multisource unsupervised domain adaptation with category shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3964–3973 (2018).
Lee, J., Sattigeri, P. & Wornell, G. Learning new tricks from old dogs: Multisource transfer learning from pretrained networks. Adv. Neural. Inf. Process. Syst. 32, 4370–4380 (2019).
Nguyen, V.A., Nguyen, T., Le, T., Tran, Q. H. & Phung, D. Stem: An approach to multisource domain adaptation with guarantees. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9352–9363 (2021).
Venkat, N. et al. Your classifier can secretly suffice multisource domain adaptation. Adv. Neural. Inf. Process. Syst. 33, 4647–4659 (2020).
Wen, J., Greiner, R. & Schuurmans, D. Domain aggregation networks for multisource domain adaptation. In International Conference on Machine Learning, 10214–10224 (PMLR, 2020).
Yue, X., Zheng, Z., Das, H. P., Keutzer, K. & Vincentelli, A. S. Multisource fewshot domain adaptation. arXiv preprintarXiv:2109.12391 (2021).
Zhao, S. et al. Multisource distilling domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 12975–12983 (2020).
Amosy, O. & Chechik, G. Coupled training for multisource domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 420–429 (2022).
Liang, J., Hu, D. & Feng, J. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, 6028–6039 (PMLR, 2020).
Ahmed, S. M., Raychaudhuri, D. S., Paul, S., Oymak, S. & RoyChowdhury, A. K. Unsupervised multisource domain adaptation without access to source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10103–10112 (2021).
Feizi, S., Makhdoumi, A., Duffy, K., Kellis, M. & Medard, M. Network maximal correlation. IEEE Trans. Netw. Sci. Eng. 4, 229–247 (2017).
Zhao, K., Chu, W.S. & Zhang, H. Deep region and multilabel learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3391–3399 (2016).
Fan, Y., Lu, X., Li, D. & Liu, Y. Videobased emotion recognition using CNNRNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, 445–450 (2016).
Kim, D. H., Baddar, W. J., Jang, J. & Ro, Y. M. Multiobjective based spatiotemporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. Affect. Comput. 10, 223–236 (2017).
Mollahosseini, A., Chan, D. & Mahoor, M. H. Going deeper in facial expression recognition using deep neural networks. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 1–10 (IEEE, 2016).
Lopes, A. T., De Aguiar, E., De Souza, A. F. & OliveiraSantos, T. Facial expression recognition with convolutional neural networks: Coping with few data and the training sample order. Pattern Recogn. 61, 610–628 (2017).
Li, Y., Zeng, J., Shan, S. & Chen, X. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 28, 2439–2450 (2018).
Knyazev, B., Shvetsov, R., Efremova, N. & Kuharenko, A. Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprintarXiv:1711.04598 (2017).
Aly, S. F. & Abbott, A. L. Facial emotion recognition with varying poses and/or partial occlusion using multistage progressive transfer learning. In Scandinavian Conference on Image Analysis, 101–112 (Springer, 2019).
Ngo, T. Q. & Yoon, S. Facial expression recognition on static images. In International Conference on Future Data and Security Engineering, 640–647 (Springer, 2019).
Gupta, R. & Ratinov, L.A. Text categorization with knowledge transfer from heterogeneous data sources. In AAAI, 842–847 (2008).
Christodoulidis, S., Anthimopoulos, M., Ebner, L., Christe, A. & Mougiakakou, S. Multisource transfer learning with convolutional neural networks for lung pattern analysis. IEEE J. Biomed. Health Inform. 21, 76–84. https://doi.org/10.1109/JBHI.2016.2636929 (2017).
Meyer, J. Multitask and transfer learning in lowresource speech recognition. In Ph.D. thesis, The University of Arizona (2019).
Mansour, Y., Mohri, M. & Rostamizadeh, A. Domain adaptation with multiple sources. Adv. Neural Inf. Process. Syst. 21 (2008).
Guo, H., Pasunuru, R. & Bansal, M. Multisource domain adaptation for text classification via distancenetbandits. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 7830–7838 (2020).
Zhao, H. et al. Adversarial multiple source domain adaptation. Adv. Neural. Inf. Process. Syst. 31, 8559–8570 (2018).
Finn, C., Abbeel, P. & Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, 1126–1135 (PMLR, 2017).
Wang, B., Mendez, J., Cai, M. & Eaton, E. Transfer learning via minimizing the performance gap between domains. In Advances in Neural Information Processing Systems Vol. 32 (eds Wallach, H. et al.) (Curran Associates Inc., 2019).
Shah, D. J. Multisource domain adaptation with mixture of experts. In Ph.D. thesis, Massachusetts Institute of Technology (2019).
Qin, X., Wang, J., Chen, Y., Lu, W. & Jiang, X. Domain generalization for activity recognition via adaptive feature fusion. ACM Trans. Intell. Syst. Technol.https://doi.org/10.1145/3552434 (2022) (Just Accepted).
Kong, Y. S., Suresh, V., Soh, J. & Ong, D. C. A systematic evaluation of domain adaptation in facial expression recognition. arXiv preprintarXiv:2106.15453 (2021).
Bisogni, C., Castiglione, A., Hossain, S., Narducci, F. & Umer, S. Impact of deep learning approaches on facial expression recognition in healthcare industries. IEEE Trans. Ind. Inf. 18, 5619–5627. https://doi.org/10.1109/TII.2022.3141400 (2022).
Nguyen, C. V., Le, K. H., Tran, A. M., Pham, Q. H. & Nguyen, B. T. Learning for amalgamation: A multisource transfer learning framework for sentiment classification. Inf. Sci. 590, 1–14. https://doi.org/10.1016/j.ins.2021.12.059 (2022).
Fu, Y. & Liu, Y. Contrastive transformer based domain adaptation for multisource crossdomain sentiment classification. Knowl.Based Syst. 245, 108649. https://doi.org/10.1016/j.knosys.2022.108649 (2022).
Zhang, X., Pan, F., Wang, W. & Nobel, A. Mining nonredundant high order correlations in binary data. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, vol. 1, 1178 (NIH Public Access, 2008).
Hirschfeld, H. O. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, vol. 31, 520–524 (Cambridge University Press, 1935).
Gebelein, H. Das statistische problem der korrelation als variationsund eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMMJ. Appl. Math. Mech./Zeitschrift für Angewandte Mathematik und Mechanik 21, 364–379 (1941).
Rényi, A. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica 10, 441–451 (1959).
Huang, S.L., Makur, A., Zheng, L. & Wornell, G. W. An informationtheoretic approach to universal feature selection in highdimensional inference. In 2017 IEEE International Symposium on Information Theory (ISIT), 1336–1340 (IEEE, 2017).
Nguyen, H. V., Müller, E., Vreeken, J., Efros, P. & Böhm, K. Multivariate maximal correlation analysis. In International Conference on Machine Learning, 775–783 (PMLR, 2014).
Reshef, D. N. et al. Detecting novel associations in large data sets. Science 334, 1518–1524 (2011).
Hotelling, H. Relations between two sets of variates. In Breakthroughs in statistics, 162–190 (Springer, 1992).
Goodfellow, I. J. et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing, 117–124 (Springer, 2013).
ICML’13: Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28 (JMLR.org, 2013).
Lyons, M. J., Kamachi, M. & Gyoba, J. Coding facial expressions with Gabor wavelets (ivc special issue). arXiv preprintarXiv:2009.05938 (2020).
Li, S., Deng, W. & Du, J. Reliable crowdsourcing and deep localitypreserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2584–2593 (IEEE, 2017).
LoBue, V. & Thrasher, C. The child affective facial expression (cafe) set: Validity and reliability from untrained adults. Front. Psychol. 5, 1532 (2015).
Zhao, H. et al. Multiple source domain adaptation with adversarial learning (2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).
Acknowledgements
This research work is funded by the Startup Research Grant (Ref. ID.: SRG/2020/002454) of the Science and Engineering Research Board, Government of India and is partly supported by the Infosys Center for AI and the Center for Design and New Media (A TCS Foundation Initiative supported by Tata Consultancy Services) at IIITDelhi. We acknowledge Shubhangi Butta and other HMI Lab members for their valuable feedback and insights throughout this research.
Author information
Authors and Affiliations
Contributions
A.B. was responsible for conceptualizing the method, conceiving and conducting the experiments, documenting and analyzing the results, and preparing the manuscript. A.S. and P.B. implemented the method and conducted the experiments. A. B. and J.S. worked together on problem formulation, and J.S. supervised the experiments and revised the manuscript. All authors participated in reviewing the manuscript.
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
B, A., Sarkar, A., Behera, P.R. et al. Multisource transfer learning for facial emotion recognition using multivariate correlation analysis. Sci Rep 13, 21004 (2023). https://doi.org/10.1038/s4159802348250x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4159802348250x
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.