Introduction

Sentiment analysis is considered one of the fundamental problems in natural language processing (NLP), and with social media developing rapidly, it is widely applied in real scenarios such as comment analysis, food safety monitoring, and public opinion mining. Such tasks are usually defined as identifying the emotional polarity (e.g., positive, negative, or neutral) of a given text, sentence, or aspect.

The expression of emotion can be explicit or implicit. The implicit expression of emotions is defined as ’A language fragment (sentence, clause or phrase) that expresses subjective sentiment but contains no explicit sentiment word’1,2. We exploit the following examples to show the difference of two expressions:

Explicit:你做的蛋炒饭太好吃了,我很喜欢! (English translation: “The rice fried with eggs is so delicious; I like it very much!” label=positive)

Implicit:这家蛋炒饭有种妈妈的味道! (English translation: “The rice fried with eggs in this restaurant reminds me of my mother!” label=positive)

Example 1 uses the word ‘喜欢 (like) ’ to show a clear positive tendency. To express their views, people also use Example 2 (e.g., metaphor, sarcasm). In this sentence, no explicit emotional words are used and the individual’s emotional tendency is embedded in the semantic meaning of the text. This phenomenon creates an exceptional challenge for implicit sentiment classification tasks. Moreover, accompanied by the absence of a large-scale labeled corpus, even with an advanced deep learning model, the classification accuracy of implicit sentiment classification tasks is not ideal3.

One solution to this problem is cross-domain sentiment classification, which aims to exploit the rich labeled data in the source domain, e.g., explicit sentiment corpus, to help the sentiment analysis task in another domain lacking for or even without labeled data, e.g., implicit sentiment corpus. Recently, relevant models on cross-domain sentiment classification have mainly focused on learning domain-invariant features whose distribution is similar in the source and target domains4. These methods attempt to reduce the discrepancy between domain-specific latent feature representations. Inspired by this idea, most existing adversarial-based methods, e.g., domain adversarial neural network (DANN) ]5, reduce feature differences by fooling a domain discriminator and have achieve promising results6,7. However, to achieve explicit-to-implicit positive transfer, these methods still have two major inherent drawbacks that need to be addressed:

  • Existing studies mostly focus on learning the domain-invariant information (e.g. ‘喜欢 (like)’, ‘坏 (bad)’, ‘差 (weak)’), rarely consider the usage of domain-specific semantic information (e.g. ‘蛋炒饭 (the rice fried with eggs)’, ‘抵抗力 (resistance)’), which is also helpful for cross-domain implicit sentiment classification7,8. Figure 1 shows that when the source domain-specific words appear in the target domain, the semantic knowledge learned from the source domain helps the target domain classification.

  • Traditional adversarial-based models merely minimize the marginal distribution of the two domains and ignore maximizing the class-specific decision boundaries. As shown in Fig. 2(DANN), the features near the decision boundary may be ambiguous and even tangled together with traditional domain discriminator training, thus blocking adaptation performance.

To tackle the above limitations identified above, we aim to use graph convolutional networks (GCNs). GCNs have a multilayer architecture, with each layer aggregating the information of nodes in the graph structure using features of immediate neighbors. Nevertheless, sequential free texts are unstructured data. Therefore, GCN-based text learning must conduct graph representation learning from the free text before graph convolution. Different from sequence learning models, GCNs can directly represent complex structured data. A GCN has the potential to capture domain-specific semantical information with GCN layers. Recently, GCN models have gained widespread attention and have been successfully deployed on text-word relationships9,10,11, and explicit sentiment analysis12,13. However, these graph-based models only considered intrasentence hierarchical dependency relationships and ignored intersentence semantic associations.

Figure 1
figure 1

Example of domain-invariant and domain-specific. The sentiment expressions marked by red lines are virus domain-specific, while the broken blue lines marked domain-invariant.

Figure 2
figure 2

A comparison between the traditional domain discriminator DANN and the proposed MGD, where minimization means that the distribution difference is minimized in different domains and maximization means that the distribution difference is maximized between different classes, which come from different domains.

Therefore, in this paper, we propose a novel context-aware semantic adaptation network (CASA) for cross-domain implicit sentiment classification via GCNs. To obtain inter- and intrasentence semantic associations, we build a context-aware heterogeneous graph (CAHG). CAHGs build graphs in each document by regarding tokens and sentences as nodes (hence heterogeneous graph). The intrasentence propagation is constrained by the syntactic dependency tree, and intersentence propagation is constrained by the sentence-free sequence and term frequency-inverse document frequency (TF-IDF) local token cooccurrence information. The information propagates inter- and intrasentences via the GCN layers, followed by an attention mechanism that keeps high-level domain-specific features. We also conduct a multigrain discriminator (MGD) which is imposed during domain adaptation to minimize domain distribution and maximize class identification. The domain adaptation layer makes the source domain and target domain inseparable e through adversarial training, reduces the representation distribution gap between source and target domain data in a coarse-grained manner. The class adaptation layer utilize a classifier to judge different domain samples, whether class consistent, to distinguish samples in different classes. Figure 2 illustrates the difference between the traditional domain adversarial method of domain-adversarial training of DANN5 and MGD.

In short, the main contributions of this paper are as follows:

  1. 1.

    We proposed a new transfer learning model CASA for the implicit sentiment classification task via GCNs, which is the first attempt to transfer explicit sentiment information to implicit sentiment.

  2. 2.

    Our model provides a context heterogeneous graph, which can effectively extract inter- and intrasentence semantic information. Moreover, CASA improved the model’s generalization ability for implicit classification tasks and increased each sentiment’s polarity discrimination in the domain.

  3. 3.

    We evaluate CASA on the Chinese implicit sentiment analysis dataset (SMP-ECISA 2019). CASA outperforms existing models in four different source domains. We also provide a visualization to demonstrate that CAHG can capture domain-specific information, and MGD can make features near decision boundaries more distinguishable.

Related works

Sentiment analysis

The existing sentiment calculation and sentiment analysis methods can be divided into three categories: knowledge-based methods, statistical methods and hybrid methods14.

Knowledge-based methods are popular because of their simplicity and ease of use, but their effectiveness is limited mainly by the depth and breadth of the established knowledge base. Statistical methods based on machine learning and deep learning have been widely used in Chinese sentiment classification, such as fine-grained sentiment analysis framework15, multi-label sentiment analysis model16, aspect-level sentiment analysis research based on Reinforcement Learning17, etc. On the other hand, more and more scholars have realized the particularity of Chinese characters and tried to model Chinese radicals18,19,20,21,22. The hybrid method aims to describe the rules of emotional expression better and realize the machine’s perception of semantics23. The model CASA in this paper considers contextual semantic perception and introduces cross-domain explicit emotional knowledge.

Attention mechanism shows good performance in sentiment analysis tasks. It improves the interpretability of neural networks by letting people know the location of the focus24,25,26,27. They exemplified recent research on attention-based sentiment analysis.

Various sentiment analysis tasks usually focus on realizing binary classification (positive and negative classification), which cannot better describe emotions. In contrast, Wang proposed a multi-level emotion perception method with contradiction processing28. However, these sentence-level sentiment analyses cannot be directly applied to implicit sentiment tasks because, in implicit sentiment tasks, the emotion of the target sentence holds different polarities for different contexts.

Implicit sentiment analysis and GCNs

Models based on RNNs29,30 and CNNs31 in deep neural networks are widely used in sentiment classification tasks. In RNN-based models, an attention mechanism is usually introduced because each word in the text contributes differently to the classification task32,33,34. CNN-based models35,36 use character-level CNNs to extract semantic information from text. However, these models lack an effective mechanism to capture the information in the dependency tree structure. In works of37,38,39, structural and semantic information extracted from the tree structure of sentences, such as a dependency tree or grammar tree by LSTM or BiLSTM, was used for the sentiment classification task. Although Tree-LSTM can extract more accurate semantic information from text, it is difficult to perform parallel computing and requires a longer time to train. After that, CNNs were introduced into the tree structure information encoding process by40,41. In their work, a phrase structure tree and syntax-dependent tree were used to encode the semantic information of the target sentence and the context, respectively. However, in the above tree-based convolutional neural network model, sentences are considered to be independent of each other, so information regarding the relationship between sentences is lost. To solve this problem, our model extracts semantic information via GCNs9.

GCN models have attracted widespread attention and have been successfully deployed in NLP tasks. Yao et al.10 builds a corpus-level text graph by word-word co-occurrence and document-word relations for text classification. Zhang et al.42 introduces a TreeGCN, where the GCN is used to encode the dependency syntactic structure. Zhang et al.12 presented an aspect-based GCN to demonstrate that GCNs can achieve long-range word dependency. Zhang et al.43 employ gated graph neural networks document-level graph word interaction. In contrast to their works, we regard the tokens and sentences in each document as graph nodes. The graph maintains inter- and intrasentence constraints to capture semantic information. It can obtain more accurate text semantics while increasing the interpretability of the model.

Transfer learning in sentiment analysis

Even with a strong deep learning model, the classification accuracy of implicit emotion classification problems is not ideal in the absence of sufficient labeled data34,41. Yosinski et al.44 discusses the application of transfer learning in deep neural networks for single domains. It draws an important conclusion: adding fine-tuning will improve the performance of the deep transfer network. In the cross-domain scenario, the difference between the probability distribution of the source domain’s data and the target domain’s data is significant. The use of fine-tuning alone may lead to negative transfer. The purpose of domain adaptation, also known as cross-domain learning, is to reduce the difference in distribution representation of the data of the source and target domains. In domain adaptation, a direct way to solve the above problem is to use a certain distribution distance measurement method to measure the distance between distributions and reduce the distance in the model training process. However, the calculation of distance measurements is difficult. Many methods based on domain adversarial models have been proposed5,6.

However, existing domain adversarial models for sentiment analysis focus on explicit sentiment classification or aspect-level sentiment classification45,46 without considering implicit situations. Due to data scarcity and the task’s value, transfer learning is more urgent for implicit sentiment analysis. To the best of our knowledge, CASA is the first explicit-to-implicit transfer learning model. MGD improved the model’s generalization ability to implicit emotion classification and the discrimination of each sentiment polarity.

Figure 3
figure 3

Overview of the context-aware semantic adaptation network. CAHG represents the context background structure in both the source and target domains. Attention layers are hierarchical attention, token-level of target sentence, and sentence-level of context background. There are two layers of GCNs. MGD represents the proposed multigrain discriminator. Details on the CAHG and MGD are discussed in later sections.

Methods

Problem definition and approach overview

Domain definition

A domain D consists of marginal distribution P(x) and m-dimensional feature space X, namely \(D= \lbrace P(x), X\rbrace\).

Task definition

Given the domain D, the task is composed of a classifier f(x) and a class label set Y, namely \(T=\lbrace f(x), Y \rbrace\), where \(f(x)=Q(y|x)\) represent the conditional probability distribution, \(y \in Y\).

Purpose

Given a training data come from source domain \(D_s = \lbrace (x_i,y_i)\rbrace ^{n_s}_{i=1}\) and target domain \(D_t = \lbrace (x_{n_s+i},y_{n_s+i})\rbrace ^{n_t}_{i=1}\), we assume that there is a difference between the probability distribution \(P(X^s)\) and \(P(X^t)\). under this settings, the purpose is to learn a prediction function \(f(x):X \rightarrow Y\) that classify the target examples correctly during testing step.

The overview of the CASA framework is described in Fig. 3. As shown in Fig. 3, the proposed approach is divided into two steps: context modeling and transfer learning. Specifically, we first built CAHG in the context modeling phase to extract the semantic relationships unrelated to explicit and implicit sentiment domains. In the transfer learning step, MGD realizes fine-grained adaptation through a domain discriminator and label consistency discriminator so that the model has stronger generalization ability. We present the details of different components as well as the training process in the following section.

Bi-GRU encoder

First, we use the bidirectional gated recurrent unit (Bi-GRU)47 to encode the text input from the source or target domain to obtain contextualized word-level representation \({\mathbf {H}} = [{\mathbf {h}}_1, {\mathbf {h}}_2,\ldots , {\mathbf {h}}_i,\ldots , {\mathbf {h}}_n]\), where m denotes the vector dimension and \({\mathbf {h}}_i \in R^m\) is the hidden layer state vector at time i. The reason for employing this layer is to correct the information of the syntactic dependency tree, which was built by HANLP. We use HANLP for this: https://www.hanlp.com.

GCNs over the context-aware heterogeneous graph

To express the text’s more valuable information, we construct a document-level heterogeneous graph, which retains intrasentence dependency information, and has intersentence relationship representation. We define \(G = (V, E)\) as a text graph, where \(|V|=n\) is the number of nodes and V, E represents the node set and the edge set of graph G, respectively. We construct G based on the token-token dependency relationship, TF-IDF, and sentence order. The token’s TF-IDF determines the edge weight between the sentence node and the token node in the sentence. We define TF as the number of words that appear in the sentence, and IDF as the logarithm of the overall sentence number in the text to the sentence number, which contains the token. The formal definition is as follows formula (13).

$$\begin{aligned} TF-IDF= & {} TF \times IDF \end{aligned}$$
(1)
$$\begin{aligned} TF_{w}= & {} \frac{S_{w}}{S} \end{aligned}$$
(2)
$$\begin{aligned} IDF_w= & {} log\left( \frac{Doc}{Doc_{w}+1}\right) \end{aligned}$$
(3)

Among them, \(S_w\) and S represent the number of token occurrences in the input sentence and the sentence token’s total number, respectively. \(Doc_w\) and Doc represent the number of sentences in the input text with the token and the total number of sentences, respectively.

Figure 4
figure 4

A tony demo of the CAHG. \(\hbox {S}_i\) represents for the i-th sentence, \(t_{ij}\) represents for the j-th token in the i-th sentence, the intrasentence edges between the tokens are dependency relations, and the green color represents these tokens at least part of the Chinese characters are the same.

Meanwhile, we introduce the sentence sequential features to represent the sentence-sentence relationship. The matrix \({\mathbf {H}} \in R^{m \times n}\) consists of the feature vector \({\mathbf {h}}_i \in R^m\) of n nodes, and the adjacency matrix \(A \in R^{m \times n}\) is used to represent the weights between nodes in graph G. The relationships of the edges between nodes p and q are formally defined as function (45):

$$\begin{aligned} A_{p,q}= & {} \left\{ \begin{array}{ll} D(p,q)&{}\text {p,q all tokens} \\ TF-IDF_{p,q}&{}\text {p is sentence, q is token} \\ Order(p,p+1)&{}\text {p is sentence} \\ 0 &{}\text {otherwise} \\ \end{array} \right. \end{aligned}$$
(4)
$$\begin{aligned} D_{p,q}= & {} \left\{ \begin{array}{ll} Tree(p,q) &{}\text {p,q in same sentence} \\ \frac{N_{syntactic}}{N_{total}}&{} \text { p,q in different sentence} \\ \end{array} \right. \end{aligned}$$
(5)

\(Order(p,p+1)\) denotes sentence \(p \rightarrow p + 1\)’s natural reading order in the text. Tree(pq) is the relationship between token nodes p and q in the dependency tree. \(N_{syntactic}\) represents the number of times that the tokens p and q have a relationship in the current text (the relationship means that p and q in at least part of the Chinese characters are the same), and \(N_{total}\) is the number of times that p and q appear throughout the dataset. To facilitate the description of the process of information transfer in the graph, we divide CAHG into two subgraphs in Fig. 4. One of the subgraphs describes the transfer of information within sentences, and the other describes the transfer of information across sentences.

The information propagates inter- and intrasentences via the GCN layers, followed by a hierarchical attention layer that keeps high-level domain-specific features. The final text representation is defined as follows:

$$\begin{aligned} \alpha _{t,i}= & {} \frac{\exp \left( \omega _{t}^{\top } \tanh \left( {\mathbf {W}}_{t} {\mathbf {H}}^G_{t,i}+{\mathbf {b}}_{t}\right) \right) }{\sum _{i^{\prime }} \exp \left( \omega _{t}^{\top } \tanh \left( {\mathbf {W}}_{t}, {\mathbf {H}}^G_{t,i^{\prime }}+{\mathbf {b}}_{t}\right) \right) }\end{aligned}$$
(6)
$$\begin{aligned} {\mathbf {O}}_{t}= & {} \sum _{i} \alpha _{t, i} {\mathbf {H}}^G_{t,i}\end{aligned}$$
(7)
$$\begin{aligned} {\mathbf {r}}= & {} (\mathbf {O_{t}} \oplus \mathbf {O_{c}}) \end{aligned}$$
(8)

In formula (68), \({\mathbf {r}}\) is the final representation of the text, which is composed of the target sentence \({\mathbf {O}}_t\) and the relevant context \({\mathbf {O}}_c\). \(\oplus\) represents the splice operation. \({\mathbf {H}}_{t}^{G}\) represents the output from two-layer GCNs5. Then \({\mathbf {H}}_{t}^{G}\) through a one-layer multilayer perceptron. Finally, \({\mathbf {O}}_t\) is obtained by weighted summation of the MLP output. The sentence-level attention mechanism \({\mathbf {O}}_c\) is a mirror of the word-level attention mechanism.

Multi-grain discriminator

Ben-David and Ganin proved that to perform domain adaptation to reduce the target domain’s prediction error, maximizing the discriminator error between the source and target is necessary5,48,49. Therefore, adversarial-based transfer learning is widely used to solve the domain adaptation problem. Although it has considerable advantages, we observed that the domain discriminator could only reduce the marginal distribution distance.

However, the relationship between marginal distributions and conditional distributions is uncertain. As indicated in50, minimizing the difference between conditional distributions is critical to the robustness of distribution adaptation. We propose MGD, which consists of a domain discriminator T and an emotional polarity discriminator D. T makes the source domain and target domain not separable through domain confrontation. D maximizes the difference between different labels through label consistency identification. When the class discriminator D can accurately identify whether the sample labels from the source domain and the target domain are the same, the model can learn the class invariant features from the two domains. The formula is derived as follows:

$$\begin{aligned} G_{T}= & {} softmax({\mathbf {W}}_{T}r_{\cdot }+{\mathbf {b}}_{T}),\mathbf {\cdot } \in (s,t) \end{aligned}$$
(9)
$$\begin{aligned} L_{T_{i}}= & {} -Y^{D}_{i}log(G_{T}({\mathbf {r}}_{i})-(1-Y^{D}_{i})log(G_{T}({\mathbf {r}}_{i}))) \end{aligned}$$
(10)

where \(G_{T}\) denotes the domain discriminator, \(G_{T}(\cdot )\) is the output prediction labels, \(L_{T_{i}}\) is the classification error, \(Y^{D}=[s,t]\) is a set of domain labels, and s and t represent the source and target domains, respectively.

To maximize the extraction of domain invariant features, we hope to maximize the discrimination error \(G_{t}\). As suggested in5, this min-max game is implemented by a gradient reversal layer. Specifically, when the network is undergoing the gradient back-propagation process, we will change \(\nabla L_{T}\) into \(-\eta \nabla L_{T}\). \(\eta > 0\) is a controllable hyper-parameter.

$$\begin{aligned} G_{D}= & {} softmax({\mathbf {W}}_{D}\mathbf {(r}_{s} \oplus {\mathbf {r}}_{t})+ {\mathbf {b}}_{D}) \end{aligned}$$
(11)
$$\begin{aligned} L_{D_{i}}= & {} -I log(G_{D}({\mathbf {r}}_{i}))-(1-I log(G_{D}({\mathbf {r}}_{i})) \end{aligned}$$
(12)

where I denote an indicator function:

$$\begin{aligned} I = \left\{ \begin{array}{ll} 1 &{} label({\mathbf {r}}_s)=label({\mathbf {r}}_t) \\ 0 &{} label({\mathbf {r}}_s)\ne label({\mathbf {r}}_t) \\ \end{array} \right. \end{aligned}$$
(13)

Here, \(G_{D}\) denotes the class-consistency discriminator, \(G_{D}(\cdot )\) is the output prediction labels, and \(L_{D_{i}}\) is the classification error. Note that as \(L_{D}\) drops, the distribution difference also decreases in the same sentiment polarity between different domains. Finally, \(G_{D}\) enhances the identifiability of each sentiment polarity.

Sentiment classification and training

The vector \({\mathbf {r}}\) obtained by the feature extractor is sent to the cascade of the fully connected layer and softmax layer to generate class distributions. The formula is as follows:

$$\begin{aligned} P= softmax({\mathbf {W}}_{p}{\mathbf {r}}+ {\mathbf {b}}_{p}) \end{aligned}$$
(14)

where \(P \in R^C\) represents the predicted soft distribution, C is the number of classifications, and \(\mathbf {W_p} \in R^{c \times m}\) and \(b_{p}\) represent training weights and offsets. The cross-entropy has a loss function, given as:

$$\begin{aligned} L_C=\sum _{i \in D}\sum _{j=1}^{C}Y_{ij}logP_{ij} \end{aligned}$$
(15)

D is the document’s index with the label, and Y is the real label matrix. Therefore, we can obtain the classification loss of the source domain \(L_{C_{s}}\) and the classification loss of the target domain \(L_{C_{t}}\).

$$\begin{aligned} D_{KL}(P_s\parallel P_t)=\sum _{i=1}^{N^s}\sum _{c}^{C}p_{s}^{c} ({\mathbf {r}}_i)log\left( \frac{p_{s}^{c}({\mathbf {r}}_i)}{p_{t}^{c}({\mathbf {r}}_i)}\right) \end{aligned}$$
(16)

Before the training stage, we were motivated by42, who proposed mutual learning in supervised single-domain tasks. The Kullback Leibler divergence of the predicted source class distribution and predicted target class distribution is calculated and vice versa. The two KL divergences measure the similarity of the two distributions. Finally, the overall loss function in CASA consists of both source and target loss, which are given as follows: \(L_{s}=L_{C_{s}}-\eta _{T}L_{T_{s}} + \eta _{D}L_{D_{s}}+\lambda D_{KL}(P_s\parallel P_t)\) and \(L_{t}=L_{C_{t}}-\eta _{T}L_{T_{t}} + \eta _{D}L_{D_{t}}+\lambda D_{KL}(P_t\parallel P_s)\). Here, \(\eta _{T}\), \(\eta _{T}\) and \(\lambda\) are hyperparameters.

Experiments

Datasets and evaluation indicator

Our target domain data set is The Evaluation of Chinese Implicit Sentiment Analysis task in SMP2019 (one of the top academic conferences on social media processing in China). The dataset contains two types of content in each document: context and target sentence. We chose four benchmark datasets of explicit sentiments as the source domain. They are the Weibo-60000 dataset, the hotel review dataset, the SMP-2020’s virus Weibo dataset, and the SMP-2020’s general Weibo dataset. The data is available at: http://biendata.com/competition/smpecisa2019/, http://www.pudn.com/Download/item/id/3993718.html, http://www.searchforum.org.cn/tansongbo/corpus1.php, https://smp2020.aconf.cn/.

To clean the dataset, we performed some preprocessing. The construction of a heterogeneous graph requires contextual sentence nodes and a dependency tree structure. We performed the following preprocessing, and Table 1 summarizes the statistics:

  1. (1)

    To keep intact the dependent syntax structure, filter out sentences, which have no subject-predicate structure.

  2. (2)

    As suggested by41,51, sentiment polarity consistency exists between the context semantic background and the target sentence. To match the target domain data granularity, we randomly select a sentence in each source domain document as the target sentence and the rest as the context.

Table 1 Statistics of the target domain and source domain datasets.

We compute every model classification accuracy and F1 score in test dataset as evaluation indicator. The F1 score calculation and accuracy are shown as follows:

$$\begin{aligned} F1_{i}= & {} \frac{2 \times P_i \times R_i }{P_i +R_i } \end{aligned}$$
(17)
$$\begin{aligned} \text {Accuracy }= & {} \frac{\left| P(x)=Y(x) \subseteq x: x \in D_{t}\right| }{\left| x: x \in D_{t}\right| } \end{aligned}$$
(18)

where \(P_i\) and \(R_i\) mean the precision and recall of i-th sentiment polarity. After the above equation, we can through calculate metrics for each label, and find their unweighted mean getting macro-F1 score, i.e. \(\text {macro-F1}=\frac{1}{N}\sum _{i \in N_i}F1_{i}\). P(x) is the predicted label and Y(x) is the actual label of sample x, respectively.

Models for comparison

To fully verify and understand CASA, we divide the baseline models into two groups for comparison:

Non-transfer

To demonstrate the benefits from heterogeneous graphs, we compare with the following methods without transfer:

  • TextCNN31, TextRNN52 and BiLSTM\(+\)Att53: These are the basic deep neural networks in sentiment classification.

  • TreeLSTM54: An LSTM network based on a tree structure, which solves the problem of the emotional classification of nonlinear systems such as dependent trees.

  • TreeGCN39: BiLSTM is used to encode the input word vector to obtain the hidden state with context information. It then uses a GCN convolution to obtain the neighboring node information, which enhances the GCN’s robustness.

  • CASA-T-D: The CASA feature extractor part for examining the ability of CAHG to express text information.

Transfer

To investigate the effectiveness of each part in the CASA, we also compare the following frameworks for experiments. For a fair comparison, we use CASA-T-D as a feature extractor in other methods.

  • Fine-tuning: Initialize the CASA-T-D parameters randomly, then train on the source domain dataset, and finally fix the parameters and fine-tune the model in the target domain dataset.

  • DANN\(^+\)5: The model adopts the idea of domain adversarial, which has a feature extractor and a domain discriminator.

  • CCSA55: A unified framework for supervised domain adaptation is created.

  • d-SNE56: d-SNE is a novel technique method based on the distance metric that has achieved great transfer results on the image benchmark datasets.

  • DAS\(^+\)45: This model employs two regularizations, entropy minimization and self-ensemble bootstrapping to refine its classifier while minimizing the domain divergence.

  • DAAN\(^+\)57: DAAN is an adaptation network with dynamic adversarial.

  • SAFN\(^+\)58: SAFN is the state-of-the-arts across many visual domain adaptation benchmarks.

  • ML42: We apply the standard mutual learning in our task directly. ML can make the source domain and target domain collaborative and teach each other throughout the training process.

The original DANN, DAS, DAAN and SAFN are unsupervised domain adaptation models. As suggested in59, we use the source code of DANN, DAS, DAAN and SAFN and extend them to DANN\(^+\), \(\hbox {DAS}^+\), \(\hbox {DAAN}^+\) and \(\hbox {SAFN}^+\), which utilize target supervised information. They all have improved performances, respectively.

Table 2 Model comparison results. The state-of-the-art result of each evaluation indicator is bolded.

Implementation details

The word embeddings dimension is initialized with 200-dimensions. The GRU hidden layer size is 64 dimensions. The batch size is 64. The GCN hidden layers size is 128 dimensions, and the initial learning rate is 0.001. \(\eta _{T}\) is not constant and \(\eta _{D}\) and \(\lambda\) are set in to be 0.1 and 0.9, respectively. Details on the hyperparameters are discussed in later sections. For a fair comparison of experimental results, feature extractors for all transfer learning models are set to CASA-T-D. Adam optimizer60 is used to train up to 30 epochs on the dataset, and the loss value is output every 100 batches. The training is stopped when the verification loss does not decrease for ten consecutive times.

Main result analysis

Comparison with non-transfer

We compared the model in the context modeling stage with the current nontransfer model to explore the heterogeneous graph’s presentation ability in CASA. The results are shown in Table 2.

We note that (1) the CASA-T-D results are far better than those of the other models. Despite the same tree structure, the TreeGCN results are 7.4% higher than that of TreeLSTM, which shows that GCN captures depth features more effectively. (2) CASA-T-D and TreeGCN are both GCN convolutions but different in representation learning. In comparison, the result of CASA-T-D is 2.02% higher than that of TreeGCN. (3) The BiLSTM \(+\) Att has a higher performance than TextRNN. One possible reason is that the attention mechanisms in text classification plays an important role. Thus, it could be more convincing that the CASA-T-D model has superior performance mainly due to the CAHG, which gathers rich semantic information.

Table 3 Model comparison results with domain adaptation.
Table 4 Ablation study results.

Comparison with transfer

It can be observed from the experimental results in Table 3 that (1) for the popular technology “fine-tuning”, after adding Virus, Usual and Weibo source domain data, the target data accuracy rates dropped by 0.68%, 1.4%, and 2.94%, respectively. This resembles our predicted results because the feature distribution gaps in the source and target domains are too large. Fixed source domain parameters cannot be corrected sufficiently during fine-tuning; finally, negative migration occurs. (2) We try to apply advanced models in the visual DA to this task, but most of the results are not very outstanding. CCSA, d-SNE and \(\hbox {DAAN}^+\) even showed negative transfer on individual datasets. For example, the accuracy of CCSA and d-SNE on Virus Target dropped by 0.27 and 0.68, respectively. This result may be caused by the diversity between the image and the natural language processing field. (3) The target domain experimental performances are improved with domain adaptation models CASA, \(\hbox {DANN}^+\), \(\hbox {DAS}^+\), \(\hbox {SAFN}^+\) and ML for all source domains. This proves that the knowledge learned from explicit sentiment is helpful to the recognition of implicit sentiment.

(4) \(\hbox {SAFN}^+\) has the outperformance result in the transfer of Usual \(\rightarrow\) Target, which shows that many commonalities between visual domain adaptation and NLP domain adaptation could be mined. (5) In addition, the contribution of different source domains to implicit sentiment recognition is different. From Table 1, we know that the virus and hotel datasets are characterized by a small amount of data but a single topic of content. In contrast, the datasets Usual and Weibo have a large amount of data but on various topics. For CASA, the effect of single-topic transfer (Virus\(\rightarrow\)Target, Hotel\(\rightarrow\)Target) is better than that of multitopic transfer (Usual\(\rightarrow\)Target, Weibo\(\rightarrow\)Target). The Virus\(\rightarrow\)Target, with a more similar structure, has a better transfer effect than Hotel\(\rightarrow\)Target.

Ablation study

To further compare each component CASA’s contribution, we sorted out the data of the ablation study part of the main experiment and plotted it into a line chart, as shown in Table 4. From Table 4, we can intuitively find the following information.

(1) Removing the CAHG, class discriminator, and domain discriminator of CASA separately, the experimental results drop by 1.44%, 1.65%, and 2.06% and are still higher than those of the nontransfer model CASA-T-D (82.59%). This shows that the fine-grained adjustment contributes to this transfer task, and the coarse-grained adjustment has greater performance than the heterogeneous structure. (2) When CASA-T-D removes CAHG and CASA* removes MGD, the accuracy drops by 0.78% and 2.16%, respectively, indicating that the CAHG and MGD we proposed can have a great impact on the experimental results.

Figure 5
figure 5

Hyperparameter study. For all \(\eta _D\) values, the source domain is virus. Accuracy is reported as the average result over 3 runs with random initialization.

Table 5 Case study. Visualization of attention weight distribution from CASA-T-D and CASA on testing examples. For CASA, the source domain is Virus. Marker \(\checkmark\) signifies correct prediction, while marker × signifies incorrect prediction. Blue denotes the sentence weight, and red denotes the word weight.

Hyperparameter study

In this section, we will present how to choose the value of hyperparameters \(\eta _T\), \(\eta _D\) and \(\lambda\).

Hyperparameter \(\eta _T\)

Inspired by5, \(\eta _T\) is not a constant, but changes from 0 to 1, namely \(\eta _{T}=\frac{2}{1+\exp (-\alpha \cdot p)}-1\). Wherein, the hyperparameter \(-\alpha\) in this paper is set to 10 as in5; The relative value p of the iterative process, that is, the current number of training steps / the total number of training steps, changes from 0 to 1 with the progress of training. The above formula means that at the beginning,\(\eta _{T}= 0\) , the domain classification loss will not be passed back to the feature extractor network, and only the domain classifier is trained; As the training progresses, \(\eta _{T}\) gradually increases, and the feature extractor is trained and begins to gradually generate features that can confuse the domain classifier.

Hyperparameters \(\eta _D\) and \(\lambda\)

Which are selected through the validation set. First we have to judge whether \(\eta _D\) and \(\lambda\) is the same order of magnitude, and the order of magnitude after verification is \(-1\).Therefore, set\(\eta _D + \lambda = 1\) , When the verification set loss is minimum, checking the test accuracy.

Figure 5 shows the test set accuracy under different \(\eta _D\) values. From (a), it shows that the optimal range of \(\eta _D\) value is 0 to 0.2. Thus we did further experiments, as shown in (b). From (b), we can see that the optimal \(\eta _D\) value is 0.10, so \(\lambda\) is 0.90.

Figure 6
figure 6

t-SNE visualization of the distribution of the features after different domain adaptation methods. Different colors indicate different sentiment polarities. (source domain: positive = dodger blue, negative = light coral; target domain: positive = light green, negative = gold).

Effectiveness verification

Case study

We want to explore what domain-specific information, which has been learned from the source domain, would enhance implicit emotion classification. In Table 5, we visualized the hierarchical attention layer in the nontransfer model CASA-T-D and the full model CASA. Using the CASA-T-D model as a benchmark, we compare the differences in attention weight distribution of different models under the same text.

The first two samples contain only one context and the target sentence. Although the two models’ attention scores are different, the critical points of emotional judgment are noted, such as ‘差 (poor)’ and ‘伤害 (harm)’. Obviously, these tokens are domain-independent words, which are usually used in many domains. In the third example, the CASA model learned the virus domain-specific information ‘抵抗力 (resistless)’, but CASA-T-D could not.

Then, we examine the impact of long contextual text on the two models. As shown in the last example, CASA-T-D focuses most of its attention on the token ‘不到 (not)’ in the target sentence and does not notice the token ‘不到 (good)’ in the context. In contrast, the CASA attention score is relatively scattered, conducive to the judgment of emotional polarity, which may benefit from source domain knowledge.

Feature visualization

To better illustrate how the CASA works, we used t-SNE61 to reduce the dimensionality of the feature to two and visualize the data distributions after domain adaptation Virus \(\rightarrow\) Target.

Figure 6 shows that the baseline models (a), (b) and (c) all virtually guarantee the source and target domain data fusion. On the other hand, the distance between the classes in the domain is still very close.

In contrast, benefitting from our proposed MGD, CASA’s intraclass boundary is significantly more significant, conducive to intraclass identification.

Conclusions and future work

This paper proposes a CASA network via graph convolution for the cross-domain implicit sentiment classification problem, first building a relation between explicit and implicit sentiment. Existing studies either rarely consider using domain-specific semantic information or ignore maximizing class-specific decision boundaries. We aim to address the above two drawbacks. First, CASA provides a CAHG, effectively extracting domain-specific semantical information for both sources and targets. Hence, CASA improved the model’s generalization ability for implicit classification tasks. The case study shows that our model can effectively capture high-level domain-specific features. Second, CASA conducts an MGD to adapt the domain distribution, enhancing class distinction in each sentiment polarity decision boundary during domain adaptation. The feature visualization results show that CASA clarifies samples’ boundaries from different classes while adapting to the domain.

Moreover, there are several worthy challenges in cross-domain implicit sentiment tasks, such as transferring between the different single-domain topics, fine-grain sentiment transferring, ambivalence handling, and transferring explicit-to-implicit topics where the target domain tags are not given. We believe that all these factors can help us comprehend the link between explicit and implicit sentiment and that implicit sentiment analysis will be solved more effectively in the future.