Context aware semantic adaptation network for cross domain implicit sentiment classification

Cross-domain sentiment classification could be attributed to two steps. The first step is used to extract the text representation, and the other is to reduce domain discrepancy. Existing methods mostly focus on learning the domain-invariant information, rarely consider using the domain-specific semantic information, which could help cross-domain sentiment classification; traditional adversarial-based models merely focus on aligning the global distribution ignore maximizing the class-specific decision boundaries. To solve these problems, we propose a context-aware semantic adaptation (CASA) network for cross-domain implicit sentiment classification (ISC). CASA can provide more semantic relationships and an accurate understanding of the emotion-changing process for ISC tasks lacking explicit emotion words. (1) To obtain inter- and intrasentence semantic associations, our model builds a context-aware heterogeneous graph (CAHG), which can aggregate the intrasentence dependency information and the intersentence node interaction information, followed by an attention mechanism that remains high-level domain-specific features. (2) Moreover, we conduct a new multigrain discriminator (MGD) to effectively reduce the interdomain distribution discrepancy and improve intradomain class discrimination. Experimental results demonstrate the effectiveness of different modules compared with existing models on the Chinese implicit emotion dataset and four public explicit datasets.

www.nature.com/scientificreports/ cross-domain sentiment classification have mainly focused on learning domain-invariant features whose distribution is similar in the source and target domains 4 . These methods attempt to reduce the discrepancy between domain-specific latent feature representations. Inspired by this idea, most existing adversarial-based methods, e.g., domain adversarial neural network (DANN) ] 5 , reduce feature differences by fooling a domain discriminator and have achieve promising results 6,7 . However, to achieve explicit-to-implicit positive transfer, these methods still have two major inherent drawbacks that need to be addressed: • Existing studies mostly focus on learning the domain-invariant information (e.g. '喜欢 (like)' , '坏 (bad)' , ' 差 (weak)'), rarely consider the usage of domain-specific semantic information (e.g. '蛋炒饭 (the rice fried with eggs)' , '抵抗力 (resistance)'), which is also helpful for cross-domain implicit sentiment classification 7,8 . Figure 1 shows that when the source domain-specific words appear in the target domain, the semantic knowledge learned from the source domain helps the target domain classification. • Traditional adversarial-based models merely minimize the marginal distribution of the two domains and ignore maximizing the class-specific decision boundaries. As shown in Fig. 2 (DANN), the features near the decision boundary may be ambiguous and even tangled together with traditional domain discriminator training, thus blocking adaptation performance.
To tackle the above limitations identified above, we aim to use graph convolutional networks (GCNs). GCNs have a multilayer architecture, with each layer aggregating the information of nodes in the graph structure using features of immediate neighbors. Nevertheless, sequential free texts are unstructured data. Therefore, GCNbased text learning must conduct graph representation learning from the free text before graph convolution. Different from sequence learning models, GCNs can directly represent complex structured data. A GCN has the potential to capture domain-specific semantical information with GCN layers. Recently, GCN models have  www.nature.com/scientificreports/ gained widespread attention and have been successfully deployed on text-word relationships [9][10][11] , and explicit sentiment analysis 12,13 . However, these graph-based models only considered intrasentence hierarchical dependency relationships and ignored intersentence semantic associations.
Therefore, in this paper, we propose a novel context-aware semantic adaptation network (CASA) for crossdomain implicit sentiment classification via GCNs. To obtain inter-and intrasentence semantic associations, we build a context-aware heterogeneous graph (CAHG). CAHGs build graphs in each document by regarding tokens and sentences as nodes (hence heterogeneous graph). The intrasentence propagation is constrained by the syntactic dependency tree, and intersentence propagation is constrained by the sentence-free sequence and term frequency-inverse document frequency (TF-IDF) local token cooccurrence information. The information propagates inter-and intrasentences via the GCN layers, followed by an attention mechanism that keeps highlevel domain-specific features. We also conduct a multigrain discriminator (MGD) which is imposed during domain adaptation to minimize domain distribution and maximize class identification. The domain adaptation layer makes the source domain and target domain inseparable e through adversarial training, reduces the representation distribution gap between source and target domain data in a coarse-grained manner. The class adaptation layer utilize a classifier to judge different domain samples, whether class consistent, to distinguish samples in different classes. Figure 2 illustrates the difference between the traditional domain adversarial method of domain-adversarial training of DANN 5 and MGD.
In short, the main contributions of this paper are as follows: 1. We proposed a new transfer learning model CASA for the implicit sentiment classification task via GCNs, which is the first attempt to transfer explicit sentiment information to implicit sentiment. 2. Our model provides a context heterogeneous graph, which can effectively extract inter-and intrasentence semantic information. Moreover, CASA improved the model's generalization ability for implicit classification tasks and increased each sentiment's polarity discrimination in the domain. 3. We evaluate CASA on the Chinese implicit sentiment analysis dataset (SMP-ECISA 2019). CASA outperforms existing models in four different source domains. We also provide a visualization to demonstrate that CAHG can capture domain-specific information, and MGD can make features near decision boundaries more distinguishable.

Related works
Sentiment analysis. The existing sentiment calculation and sentiment analysis methods can be divided into three categories: knowledge-based methods, statistical methods and hybrid methods 14 .
Knowledge-based methods are popular because of their simplicity and ease of use, but their effectiveness is limited mainly by the depth and breadth of the established knowledge base. Statistical methods based on machine learning and deep learning have been widely used in Chinese sentiment classification, such as fine-grained sentiment analysis framework 15 , multi-label sentiment analysis model 16 , aspect-level sentiment analysis research based on Reinforcement Learning 17 , etc. On the other hand, more and more scholars have realized the particularity of Chinese characters and tried to model Chinese radicals [18][19][20][21][22] . The hybrid method aims to describe the rules of emotional expression better and realize the machine's perception of semantics 23 . The model CASA in this paper considers contextual semantic perception and introduces cross-domain explicit emotional knowledge.
Attention mechanism shows good performance in sentiment analysis tasks. It improves the interpretability of neural networks by letting people know the location of the focus [24][25][26][27] . They exemplified recent research on attention-based sentiment analysis.
Various sentiment analysis tasks usually focus on realizing binary classification (positive and negative classification), which cannot better describe emotions. In contrast, Wang proposed a multi-level emotion perception method with contradiction processing 28 . However, these sentence-level sentiment analyses cannot be directly applied to implicit sentiment tasks because, in implicit sentiment tasks, the emotion of the target sentence holds different polarities for different contexts. 29,30 and CNNs 31 in deep neural networks are widely used in sentiment classification tasks. In RNN-based models, an attention mechanism is usually introduced because each word in the text contributes differently to the classification task [32][33][34] . CNN-based models 35,36 use character-level CNNs to extract semantic information from text. However, these models lack an effective mechanism to capture the information in the dependency tree structure. In works of [37][38][39] , structural and semantic information extracted from the tree structure of sentences, such as a dependency tree or grammar tree by LSTM or BiLSTM, was used for the sentiment classification task. Although Tree-LSTM can extract more accurate semantic information from text, it is difficult to perform parallel computing and requires a longer time to train. After that, CNNs were introduced into the tree structure information encoding process by 40,41 . In their work, a phrase structure tree and syntax-dependent tree were used to encode the semantic information of the target sentence and the context, respectively. However, in the above tree-based convolutional neural network model, sentences are considered to be independent of each other, so information regarding the relationship between sentences is lost. To solve this problem, our model extracts semantic information via GCNs 9 .

Implicit sentiment analysis and GCNs. Models based on RNNs
GCN models have attracted widespread attention and have been successfully deployed in NLP tasks. Yao et al. 10 builds a corpus-level text graph by word-word co-occurrence and document-word relations for text classification. Zhang et al. 42 introduces a TreeGCN, where the GCN is used to encode the dependency syntactic structure. Zhang et al. 12 presented an aspect-based GCN to demonstrate that GCNs can achieve long-range word dependency. Zhang et al. 43 44 discusses the application of transfer learning in deep neural networks for single domains. It draws an important conclusion: adding fine-tuning will improve the performance of the deep transfer network.
In the cross-domain scenario, the difference between the probability distribution of the source domain's data and the target domain's data is significant. The use of fine-tuning alone may lead to negative transfer. The purpose of domain adaptation, also known as cross-domain learning, is to reduce the difference in distribution representation of the data of the source and target domains. In domain adaptation, a direct way to solve the above problem is to use a certain distribution distance measurement method to measure the distance between distributions and reduce the distance in the model training process. However, the calculation of distance measurements is difficult. Many methods based on domain adversarial models have been proposed 5,6 . However, existing domain adversarial models for sentiment analysis focus on explicit sentiment classification or aspect-level sentiment classification 45,46 without considering implicit situations. Due to data scarcity and the task's value, transfer learning is more urgent for implicit sentiment analysis. To the best of our knowledge, CASA is the first explicit-to-implicit transfer learning model. MGD improved the model's generalization ability to implicit emotion classification and the discrimination of each sentiment polarity.

Problem definition and approach overview. Domain definition. A domain D consists of marginal
Task definition. Given the domain D, the task is composed of a classifier f(x) and a class label set Y, namely Purpose. Given a training data come from source domain D s = {(x i , y i )} n s i=1 and target domain D t = {(x n s +i , y n s +i )} n t i=1 , we assume that there is a difference between the probability distribution P(X s ) and P(X t ) . under this settings, the purpose is to learn a prediction function f (x) : X → Y that classify the target examples correctly during testing step.
The overview of the CASA framework is described in Fig. 3. As shown in Fig. 3, the proposed approach is divided into two steps: context modeling and transfer learning. Specifically, we first built CAHG in the context modeling phase to extract the semantic relationships unrelated to explicit and implicit sentiment domains. In the transfer learning step, MGD realizes fine-grained adaptation through a domain discriminator and label consistency discriminator so that the model has stronger generalization ability. We present the details of different components as well as the training process in the following section.
Bi-GRU encoder. First, we use the bidirectional gated recurrent unit (Bi-GRU) 47    GCNs over the context-aware heterogeneous graph. To express the text's more valuable information, we construct a document-level heterogeneous graph, which retains intrasentence dependency information, and has intersentence relationship representation. We define G = (V , E) as a text graph, where |V | = n is the number of nodes and V, E represents the node set and the edge set of graph G, respectively. We construct G based on the token-token dependency relationship, TF-IDF, and sentence order. The token's TF-IDF determines the edge weight between the sentence node and the token node in the sentence. We define TF as the number of words that appear in the sentence, and IDF as the logarithm of the overall sentence number in the text to the sentence number, which contains the token. The formal definition is as follows formula (1-3).
Among them, S w and S represent the number of token occurrences in the input sentence and the sentence token's total number, respectively. Doc w and Doc represent the number of sentences in the input text with the token and the total number of sentences, respectively.
Meanwhile, we introduce the sentence sequential features to represent the sentence-sentence relationship. The matrix H ∈ R m×n consists of the feature vector h i ∈ R m of n nodes, and the adjacency matrix A ∈ R m×n is used to represent the weights between nodes in graph G. The relationships of the edges between nodes p and q are formally defined as function (4)(5): Order(p, p + 1) denotes sentence p → p + 1 's natural reading order in the text. Tree(p, q) is the relationship between token nodes p and q in the dependency tree. N syntactic represents the number of times that the tokens p and q have a relationship in the current text (the relationship means that p and q in at least part of the Chinese characters are the same), and N total is the number of times that p and q appear throughout the dataset. To facilitate the description of the process of information transfer in the graph, we divide CAHG into two subgraphs in Fig. 4. One of the subgraphs describes the transfer of information within sentences, and the other describes the transfer of information across sentences. www.nature.com/scientificreports/ The information propagates inter-and intrasentences via the GCN layers, followed by a hierarchical attention layer that keeps high-level domain-specific features. The final text representation is defined as follows: In formula (6)(7)(8), r is the final representation of the text, which is composed of the target sentence O t and the relevant context O c . ⊕ represents the splice operation. H G t represents the output from two-layer GCNs 5 . Then H G t through a one-layer multilayer perceptron. Finally, O t is obtained by weighted summation of the MLP output. The sentence-level attention mechanism O c is a mirror of the word-level attention mechanism.
Multi-grain discriminator. Ben-David and Ganin proved that to perform domain adaptation to reduce the target domain's prediction error, maximizing the discriminator error between the source and target is necessary 5,48,49 . Therefore, adversarial-based transfer learning is widely used to solve the domain adaptation problem. Although it has considerable advantages, we observed that the domain discriminator could only reduce the marginal distribution distance. However, the relationship between marginal distributions and conditional distributions is uncertain. As indicated in 50 , minimizing the difference between conditional distributions is critical to the robustness of distribution adaptation. We propose MGD, which consists of a domain discriminator T and an emotional polarity discriminator D. T makes the source domain and target domain not separable through domain confrontation. D maximizes the difference between different labels through label consistency identification. When the class discriminator D can accurately identify whether the sample labels from the source domain and the target domain are the same, the model can learn the class invariant features from the two domains. The formula is derived as follows: where G T denotes the domain discriminator, G T (·) is the output prediction labels, L T i is the classification error, Y D = [s, t] is a set of domain labels, and s and t represent the source and target domains, respectively.
To maximize the extraction of domain invariant features, we hope to maximize the discrimination error G t . As suggested in 5 , this min-max game is implemented by a gradient reversal layer. Specifically, when the network is undergoing the gradient back-propagation process, we will change ∇L T into −η∇L T . η > 0 is a controllable hyper-parameter.
where I denote an indicator function: Here, G D denotes the class-consistency discriminator, G D (·) is the output prediction labels, and L D i is the classification error. Note that as L D drops, the distribution difference also decreases in the same sentiment polarity between different domains. Finally, G D enhances the identifiability of each sentiment polarity.
Sentiment classification and training. The vector r obtained by the feature extractor is sent to the cascade of the fully connected layer and softmax layer to generate class distributions. The formula is as follows: where P ∈ R C represents the predicted soft distribution, C is the number of classifications, and W p ∈ R c×m and b p represent training weights and offsets. The cross-entropy has a loss function, given as: D is the document's index with the label, and Y is the real label matrix. Therefore, we can obtain the classification loss of the source domain L C s and the classification loss of the target domain L C t . www.nature.com/scientificreports/ Before the training stage, we were motivated by 42 , who proposed mutual learning in supervised single-domain tasks. The Kullback Leibler divergence of the predicted source class distribution and predicted target class distribution is calculated and vice versa. The two KL divergences measure the similarity of the two distributions. Finally, the overall loss function in CASA consists of both source and target loss, which are given as follows: Here, η T , η T and are hyperparameters.

Experiments Datasets and evaluation indicator. Our target domain data set is The Evaluation of Chinese Implicit
Sentiment Analysis task in SMP2019 (one of the top academic conferences on social media processing in China). The dataset contains two types of content in each document: context and target sentence. We chose four benchmark datasets of explicit sentiments as the source domain. They are the Weibo-60000 dataset, the hotel review dataset, the SMP-2020's virus Weibo dataset, and the SMP-2020's general Weibo dataset. The data is available at: http:// biend ata. com/ compe tition/ smpec isa20 19/, http:// www. pudn. com/ Downl oad/ item/ id/ 39937 18. html, http:// www. searc hforum. org. cn/ tanso ngbo/ corpu s1. php, https:// smp20 20. aconf. cn/.
To clean the dataset, we performed some preprocessing. The construction of a heterogeneous graph requires contextual sentence nodes and a dependency tree structure. We performed the following preprocessing, and Table 1 summarizes the statistics: (1) To keep intact the dependent syntax structure, filter out sentences, which have no subject-predicate structure. (2) As suggested by 41,51 , sentiment polarity consistency exists between the context semantic background and the target sentence. To match the target domain data granularity, we randomly select a sentence in each source domain document as the target sentence and the rest as the context. We compute every model classification accuracy and F1 score in test dataset as evaluation indicator. The F1 score calculation and accuracy are shown as follows: where P i and R i mean the precision and recall of i-th sentiment polarity. After the above equation, we can through calculate metrics for each label, and find their unweighted mean getting macro-F1 score, i.e. macro-F1 = 1 N i∈N i F1 i . P(x) is the predicted label and Y(x) is the actual label of sample x, respectively.

Models for comparison.
To fully verify and understand CASA, we divide the baseline models into two groups for comparison: Non-transfer. To demonstrate the benefits from heterogeneous graphs, we compare with the following methods without transfer: • TextCNN 31 , TextRNN 52 and BiLSTM+Att 53 : These are the basic deep neural networks in sentiment classification. www.nature.com/scientificreports/ • TreeLSTM 54 : An LSTM network based on a tree structure, which solves the problem of the emotional classification of nonlinear systems such as dependent trees. • TreeGCN 39 : BiLSTM is used to encode the input word vector to obtain the hidden state with context information. It then uses a GCN convolution to obtain the neighboring node information, which enhances the GCN's robustness. • CASA-T-D: The CASA feature extractor part for examining the ability of CAHG to express text information.
Transfer. To investigate the effectiveness of each part in the CASA, we also compare the following frameworks for experiments. For a fair comparison, we use CASA-T-D as a feature extractor in other methods.

Main result analysis. Comparison with non-transfer.
We compared the model in the context modeling stage with the current nontransfer model to explore the heterogeneous graph's presentation ability in CASA. The results are shown in Table 2. We note that (1) the CASA-T-D results are far better than those of the other models. Despite the same tree structure, the TreeGCN results are 7.4% higher than that of TreeLSTM, which shows that GCN captures depth features more effectively. (2) CASA-T-D and TreeGCN are both GCN convolutions but different in representation learning. In comparison, the result of CASA-T-D is 2.02% higher than that of TreeGCN. (3) The BiLSTM + Att has a higher performance than TextRNN. One possible reason is that the attention mechanisms in text classification plays an important role. Thus, it could be more convincing that the CASA-T-D model has superior performance mainly due to the CAHG, which gathers rich semantic information.
Comparison with transfer. It can be observed from the experimental results in Table 3 that (1) for the popular technology "fine-tuning", after adding Virus, Usual and Weibo source domain data, the target data accuracy rates dropped by 0.68%, 1.4%, and 2.94%, respectively. This resembles our predicted results because the feature distribution gaps in the source and target domains are too large. Fixed source domain parameters cannot be corrected Table 2. Model comparison results. The state-of-the-art result of each evaluation indicator is bolded. The marker ♦ refers to p-value < 0.05 when compared with TreeGCN in the paired t-test. All models run over five times with random initializations and are report average precision, recall, macro-F1, and accuracy. (4) SAFN + has the outperformance result in the transfer of Usual → Target, which shows that many commonalities between visual domain adaptation and NLP domain adaptation could be mined. (5) In addition, the contribution of different source domains to implicit sentiment recognition is different. From Table 1, we know that the virus and hotel datasets are characterized by a small amount of data but a single topic of content. In contrast, the datasets Usual and Weibo have a large amount of data but on various topics. For CASA, the effect of single-topic transfer (Virus→Target, Hotel→Target) is better than that of multitopic transfer (Usual→Target, Weibo→Target). The Virus→Target, with a more similar structure, has a better transfer effect than Hotel→Target.
Ablation study. To further compare each component CASA's contribution, we sorted out the data of the ablation study part of the main experiment and plotted it into a line chart, as shown in Table 4. From Table 4, we can intuitively find the following information.
(1) Removing the CAHG, class discriminator, and domain discriminator of CASA separately, the experimental results drop by 1.44%, 1.65%, and 2.06% and are still higher than those of the nontransfer model CASA-T-D (82.59%). This shows that the fine-grained adjustment contributes to this transfer task, and the coarse-grained adjustment has greater performance than the heterogeneous structure. (2) When CASA-T-D removes CAHG and CASA* removes MGD, the accuracy drops by 0.78% and 2.16%, respectively, indicating that the CAHG and MGD we proposed can have a great impact on the experimental results.
Hyperparameter study. In this section, we will present how to choose the value of hyperparameters η T , η D and .
Hyperparameter η T . Inspired by 5 , η T is not a constant, but changes from 0 to 1, namely η T = 2 1+exp(−α·p) − 1 . Wherein, the hyperparameter −α in this paper is set to 10 as in 5 ; The relative value p of the iterative process, that is, the current number of training steps / the total number of training steps, changes from 0 to 1 with the progress of training. The above formula means that at the beginning,η T = 0 , the domain classification loss will Table 3. Model comparison results with domain adaptation. The state-of-the-art result of each dataset is bolded. The marker † refers to p-value < 0.05 by comparing with ML in paired t-test, while the marker ‡ refers to p-value < 0.05 by comparing with SAFN + in paired t-test. All models run over five times with random initializations and report the mean results. Hyperparameters η D and . Which are selected through the validation set. First we have to judge whether η D and is the same order of magnitude, and the order of magnitude after verification is −1.Therefore, setη D + = 1 , When the verification set loss is minimum, checking the test accuracy. Figure 5 shows the test set accuracy under different η D values. From (a), it shows that the optimal range of η D value is 0 to 0.2. Thus we did further experiments, as shown in (b). From (b), we can see that the optimal η D value is 0.10, so is 0.90.

Effectiveness verification
Case study. We want to explore what domain-specific information, which has been learned from the source domain, would enhance implicit emotion classification. In Table 5, we visualized the hierarchical attention layer in the nontransfer model CASA-T-D and the full model CASA. Using the CASA-T-D model as a benchmark, we compare the differences in attention weight distribution of different models under the same text.
The first two samples contain only one context and the target sentence. Although the two models' attention scores are different, the critical points of emotional judgment are noted, such as '差 (poor)' and '伤害 (harm)' . Obviously, these tokens are domain-independent words, which are usually used in many domains. In the third example, the CASA model learned the virus domain-specific information '抵抗力 (resistless)' , but CASA-T-D could not.
Then, we examine the impact of long contextual text on the two models. As shown in the last example, CASA-T-D focuses most of its attention on the token '不到 (not)' in the target sentence and does not notice the token '不到 (good)' in the context. In contrast, the CASA attention score is relatively scattered, conducive to the judgment of emotional polarity, which may benefit from source domain knowledge.

Feature visualization.
To better illustrate how the CASA works, we used t-SNE 61 to reduce the dimensionality of the feature to two and visualize the data distributions after domain adaptation Virus → Target. Figure 6 shows that the baseline models (a), (b) and (c) all virtually guarantee the source and target domain data fusion. On the other hand, the distance between the classes in the domain is still very close.
In contrast, benefitting from our proposed MGD, CASA's intraclass boundary is significantly more significant, conducive to intraclass identification. www.nature.com/scientificreports/

Conclusions and future work
This paper proposes a CASA network via graph convolution for the cross-domain implicit sentiment classification problem, first building a relation between explicit and implicit sentiment. Existing studies either rarely consider using domain-specific semantic information or ignore maximizing class-specific decision boundaries. We aim to address the above two drawbacks. First, CASA provides a CAHG, effectively extracting domain-specific semantical information for both sources and targets. Hence, CASA improved the model's generalization ability for implicit classification tasks. The case study shows that our model can effectively capture high-level domainspecific features. Second, CASA conducts an MGD to adapt the domain distribution, enhancing class distinction in each sentiment polarity decision boundary during domain adaptation. The feature visualization results show that CASA clarifies samples' boundaries from different classes while adapting to the domain. Moreover, there are several worthy challenges in cross-domain implicit sentiment tasks, such as transferring between the different single-domain topics, fine-grain sentiment transferring, ambivalence handling, and transferring explicit-to-implicit topics where the target domain tags are not given. We believe that all these factors can help us comprehend the link between explicit and implicit sentiment and that implicit sentiment analysis will be solved more effectively in the future. Table 5. Case study. Visualization of attention weight distribution from CASA-T-D and CASA on testing examples. For CASA, the source domain is Virus. Marker signifies correct prediction, while marker × signifies incorrect prediction. Blue denotes the sentence weight, and red denotes the word weight.