Fast-increasing electronic documents in the digital environment offer a new source to support better understanding and services to online users. More attention has been paid to extracting users’ opinions towards various events from the content of texts. The process of computationally identifying and categorizing opinions expressed in a piece of text has been highlighted as the first step of data mining. Scholars tried to extract the positive, negative, or neutral attitude toward a particular topic or product from the text (Feng et al., 2021) and recently started to further label the texts with multi-dimensional emotional tendencies, such as joy, fear, rage, etc. (Hu and Flaxman, 2018; Tasmin, 2018). Various methods have been applied to classify the emotions of the texts, but the methods based on machine learning have attracted the most attention (Chen and Zhang, 2018). Previous methods based on the dictionaries of emotions allow segmentation and classification of words for analysis of complicated emotions, preparing a dictionary of emotions is labor-intensive and time-consuming and could hardly catch up with the fast emergence of new words (Ai et al., 2018). Machine-learning algorithms, on the contrary, allow auto recognition of emotional words in texts so as to achieve classification more quickly. However, the sequential process that the machine-learning algorithms follow would inevitably result in the inability to label multiple emotions and the possibility of the heavy impact of previous steps on the following steps (Ullah et al., 2022). Problems such as the decline of classifier performance with emotion refinement, the lack of the relationship between sentences and the whole text, and the recognition of complex human emotions are also stimulating scholars to keep adjusting the algorithms to enhance their performance.

One of the key directions to improve the machine-learning algorithms for emotion classification is computational multi-label classification. Computational multi-label classification is regarded as a good solution. Multi-label classification means that an instance could be classified into multiple categories at the same time; that is, it could be marked by multiple labels. In practical applications, the semantics of real objects or real texts are often not unique, which leads to the need for multi-label learning. Mainly by proposing new emotional dictionaries, some pioneering research has made remarkable attempts in the field of multi-label classification (Yang et al., 2014; Liu and Chen, 2015). However, it becomes very difficult for algorithms to classify emotions with multiple labels. Compared with SVM and Bayesian algorithms, K-Nearest Neighbors (KNN) algorithms perform best as a multi-label classifier and are easier to construct (Keshtkar and Inkpen, 2012). The problem remains that the iterative corrections are not able to be achieved for emotion classification even for kNN algorithms.

In response to the above knowledge gaps, this study adjusts the Multi-label K-Nearest Neighbors (MLkNN) classifier and considers not only individual in-sentence features but also the features in the adjacent sentences and the full text of the tweet. Furthermore, this study further considers the interaction between labels and iteratively updates the overall classification results. Such adjustments allow iterative corrections of the multi-label emotion classification and could improve the accuracy of emotion classification for short texts. Tweets are chosen in this study as a representational source of short texts. Among all text classifications, short text classification is a special subdomain with increasing importance. Since people are now more frequently using short sentences to express opinions or share ideas with others, short text classification becomes essential in author recognition, spam filtering, sentiment analysis, Twitter personalization, customer review, and other applications related to social networks (Liang et al., 2020). Therefore, there is an expanding need for sentiment analysis of short texts on the internet.

The rest of the study is organized as follows: studies about existing emotion classification methods in the literature are summarized in section “Related work”. The adjustments to the MLkNN emotion analysis method are employed in section “An improved MLkNN for emotion classification of short texts”, followed by the experiments and results in section “Experimental study”. Lastly, the discussions and conclusions are presented in sections “Discussion” and “Conclusions”, respectively.

Related work

Human emotion has been a research hotspot for scholars since ancient times. In the era of information, emotional signals and sentiment tendencies also attract more attention when scholars extract textual features from the content of online texts to support better understandings and services to users. The specific task of sentiment classification is to identify the subjective views expressed in the specified text and judge the emotional tendencies of the text (Rajabi et al., 2020; Li et al., 2016; Fei et al., 2020). Together with the accumulation of the understanding of texts, two types of emotion classifications have been highlighted. One is the classification of emotions according to the emotional polarities, which means the positive, negative, or neutral attitudes. The other one is the classification of emotions according to emotional tendencies, which generally follows the emotion wheel proposed by American psychologist Plutchik (Hu and Flaxman, 2018; Tasmin, 2018). The introduction of emotional tendencies increases of the emotion classifications and leads to a review of the emotion classification methods.

Among the emotion classification methods, the most used and better-performing methods are mostly based on dictionaries of emotions and based on machine learning (Chen and Zhang, 2018). Emotion classification by a dictionary of emotions is a classical method with both theoretical and practical achievements (Ai et al., 2018). The main implementation processes of emotion classification include segmenting words in the text to be classified and carrying out keyword matching and other operations on these words to realize emotion classification. Ma et al. (2005) first applied the dictionary-based method to the instant messaging system. On this basis, Aman and Szpakowicz (2007) proposed a classification method by adding the emotion intensity knowledge base to the original dictionary and achieved an accuracy of more than 66% in the emotion classification task of the blog corpus. Paltoglou and Thelwall (2012) used the dictionary of emotions method to calculate the negative words, capital letters, emotional polarity, and their strength changes in the linguistic field. The accuracy rate of this method could reach 86.5% when applied to short texts on platforms such as Twitter and MySpace. Taboada et al. (2011) further expanded the dictionary of emotional features and topic-related features of the text. The accuracy of the improved method in the experiment of the Twitter corpus could reach 85.6%.

Due to the long training time of many networks, some researchers have defined several dictionaries related to emotional words, such as attitude dictionary, negative dictionary, degree dictionary and connective dictionary. In addition, there are more complex methods in recent years, such as the emotion classification method based on rules (Yan et al., 2018). These works include the classification method based on mutual information (Liu et al., 2021), the emotion classification method based on physiological signals (Shu et al., 2018), as well as the upgrading of neural networks (Tang et al., 2021). The classification method based on a dictionary of emotions could reflect the unstructured features of the text and have a high utilization rate of emotional words, but its problems are also obvious: the scarcity of corpus resources, the low update frequency of emotional words, and the inability to capture new words or deformed words. Ideal classification requires higher rates of coverage and labeling accuracy of emotional words in the dictionary. Moreover, dictionaries are highly dependent on the domain, time, language and other conditions, thus difficult to expand.

In recent years, the rapid development of machine-learning methods offers a new way for emotion text classification. Two types of schemes have been applied: supervised and semi-supervised. The common features used for emotion classification in the existing supervised learning schemes mainly include word-level, sentence-level and chapter-level features (Dogan and Uysal, 2020). Keshtkar and Inkpen (2012) adopted multi-level analysis ideas to analyze the mood of bloggers and achieved stratified emotion analysis for more than 100 mood labels. Semi-supervised learning schemes differ from supervised learning in that they require a large number of labeled samples. Semi-supervised learning could utilize a large number of unlabeled samples, which could improve the classifier performance and reduce the dependence on sample sets. Presently, the existing semi-supervised learning methods in the field of emotion classification include the semi-supervised emotion classification schemes based on multinomial Bayes, discrete binary semi-supervised learning, Emoji space model, and dual view label propagation (Sintsova et al., 2014). The main advantages of this method are that it does not depend on a large number of labeled samples and could easily obtain a large number of new labeled data as training samples through learning. It performs well when the scarcity of labeled datasets occurs. However, the disadvantages of this method are also obvious. It is very sensitive to the results of the first round of classification: the samples that could not be correctly classified in the first classification process will greatly affect the accuracy of the second classification.

New neural networks have also been applied as feature extractors. Liao et al. (2021) proposed a novel two-stage fine-grained text-level sentiment analysis model based on syntactic rule matching and deep semantics. Combining the multi-head attention mechanism in Transformer, Lou et al. (2020) proposed a fusion model of convolution neural network and hierarchical attention coding network to avoid the sequential processing of Recurrent Neural Networks (RNN), which were wildly used as the feature extractor for fine-grained sentiment analysis. The self-attention-based Bidirectional Long Short-Term Memory (BiLSTM) model with aspect item information for fine-grained sentiment classification of short texts introduced by Xie et al. (2019) allowed effective use of contextual information and semantic features. A recent study of a bidirectional convolutional RNN adopted bidirectional feature extraction to group features and enhanced the important features in each group while weakening the less important features to improve the classification accuracy (Onan, 2022). Jiang et al. (2022) mixed Bidirectional Encoder Representation from Transformers (BERT), BiLSTM and Text Convolutional Neural Network (TextCNN) into a new model, achieving not only the capture of local correlations in contexts, but also high accuracy and stability. However, machine-learning methods still face the following defects. One is that they often rely excessively on the manually labeled corpus, and could not achieve good results when the sample set size is small. Besides, unsupervised learning in machine learning is still scarce in the field of sentiment analysis.

Recently, scholars started to realize that it is often unable to accurately restore and analyze the individual’s real emotions without considering of multiple emotions contained in the text (Siriwardhana et al., 2020; Sadr et al., 2019; Liang et al., 2019). Multi-label learning originated from the investigation of text classification problems, where each document may belong to several predefined topics simultaneously. In multi-label learning, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances by analyzing training instances with known label sets (Zhang and Zhou, 2007). Yang et al. proposed a small dictionary that considers not only text, but also graphic emoticons and punctuation marks, and classify multi-label sentiment of Weibo corpus. This classification has achieved a relatively high accuracy rate and has played an active role in the analysis of public opinion on the Malaysia Airlines crash. Liu and Chen (2015) used the combination of three emotional dictionaries to extract the emotional features and the original segmentation word features in the microblog corpus, and completed a multi-label-based emotion classification method, among which the best experimental results had an average accuracy of 65.5%. In response to the imbalance of emotion category distribution in the corpus, Li et al. (2016) adopted a multi-label maximum entropy model to analyze the relationship between words and emotions. The problem remains as iterative corrections are not able to be achieved for the emotion classification. These methods thus could hardly catch up with the fast changes of emotional words in the real world.

An improved MLkNN for emotion classification of short texts

Introduction of multi-label classification and the workflow

In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple non-exclusive labels may be assigned to each instance. In the multi-label problem, the labels are non-exclusive and there is no constraint on how many of the classes the instance could be assigned to. The definition of multi-label problems could be specifically explained by the following mathematical notation:

Let X = {x1, x2, ···, xn} represent the example space, L = {L1, L2, ···, Lq} represent the set of all labels and Y = {y1, y2, ···, yn} represent the label space. To complete the multi-label classification, the first task is to obtain the function f: XY by learning from the training set {(xi, yi)|1 ≤ i ≤ m}, where xiX is an example, yiY is the category label to which the example xi belongs, and yi is a subset of the label set L. In practical applications, real objects or real text are often not unique in the semantic language, thus leading to a multi-label learning framework. In the framework of multi-labels, each object is formulated by an instance with multiple category labels, and the goal of learning is to assign all appropriate labels to that instance.

To solve the multi-label classification problem, the mainstream solutions include three different ways: problem transformation methods, adapted algorithms, and integration methods. Adapted algorithms have attracted the most attention from scholars as more choices are available: K-nearest neighbor (KNN) (Zhang and Zhou, 2007), multi-valued and multi-labeled decision tree (Chou and Hsu, 2005), kernel methods for vector output and neural networks such as BP-MLL (Zhang and Zhou, 2006) could all be applied for a better solution of the multi-label classification problem.

This paper focuses on modifying the ML-kNN algorithm through the following steps: first, the short text is divided according to the sentence, the emotion transfer relationship characteristics between adjacent sentences and the emotion transfer relationship characteristics between sentences and in the full text of the tweet are gained through the training set. MLkNN multi-label classifier is thus applied as the base classifier for the emotion classification to get the initial classification results of sentences and calculate the results of the overall emotion classification results of tweets for the test set, and then modify the overall emotion classification results of tweet specifically by the rate of emotion transfer relationship between sentences and emotion transfer in the full text of the tweet. Average Precision (AVP) is thus evaluated to decide whether the process goes to further adjustment based on label correlation or back to the steps of adjustments based on emotion transfer. The workflow of the study is shown in Fig. 1.

Fig. 1
figure 1

Workflow of the study.

This method would be more applicable for tweets, which is generally short and the expression is usually colloquial. The features in sentences could hardly support the judgment of the sentence emotional category. Deviation occurs for sentences that only contain oral vocabulary.

Actually, the data supporting the results of this study were downloaded from, which provided sentiment140 data earlier than 2009 and was not affected by the increase in the length of tweets in 2017. In this corpus, tweets are mostly in the same form as the following example. There are three short sentences in this example, all of which express “joy”. However, the first sentence is often used to express strong emotions on the Internet, and in many cases, it is used to express negative emotions such as anger or disgust. If only the features of the words in the sentence are considered, the emotion of the first sentence is likely to be classified into the category of “disgust” or “anger”. However, if the emotional transfer caused by the context of the sentence is considered at the same time, and the overall mood of Twitter is considered, the mood category of the first sentence could be modified in the case that both the adjacent sentences are joy mood.

<tweet ID= “1”>

<sentence ID= “1”>WTF! </sentence>

<sentence ID= “2”>This song is groovy!! </sentence>

<sentence ID= “3”>I love the song soooo much! </sentence>


The suitable way for the emotional classification of a tweet is to consider the emotional transfer relationship of adjacent sentences and the full text of the tweet to reduce the error caused by text ambiguity and to improve the overall classification accuracy.

Base classifier based on MLkNN

MLkNN algorithm is derived from the traditional K-nearest neighbor (KNN) algorithm. This method finds the label information contained in the K-nearest neighbor instances of the target, following statistical methods, infers the label set of the target by maximizing the posterior probability (Zhang and Zhou, 2007). When using the MLkNN algorithm for text sentiment classification, the specific calculation method is as follows. Let L be the set of emotional labels, for an emotional label lL, the event is defined as:

\(Y_l^0\): Text s does not contain the emotion label l.

\(Y_l^1\): Text s contains the emotion label l.

\(H_l^t\), (t {0, 1, ···, k}): There are exactly t texts containing label l among the K-nearest neighbors N(s) of texts.

According to the maximum posterior criterion, the category vector \(\overrightarrow y _s\left( l \right)\) is defined as follows:

$$\vec y_s\left( l \right) = \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} P\left( {\left. {Y_l^e} \right|H_l^{\vec C_S\left( l \right)}} \right)$$

In Formula 1, \(\vec C_s\left( l \right) = \mathop {\sum}\nolimits_{a \in N\left( s \right)} {{{\overrightarrow {y}}_{a}}\left( l \right)}\) is the number of nearest neighbors containing label l in the K-nearest neighbors of text s. According to the Bayesian theory, Formula 1 could be further rewritten as:

$$\vec y_s\left( l \right) = \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} \frac{{P\left( {Y_l^e} \right)P\left( {H_l^{\vec C_S\left( l \right)}\left| {Y_l^e} \right.} \right)}}{{P\left( {H_l^{\vec C_S\left( l \right)}} \right)}} = \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} P\left( {Y_l^e} \right)P\left( {H_l^{\vec C_S\left( l \right)}\left| {Y_l^e} \right.} \right)$$

In Formula 2, the prior probabilities \(P\left( {Y_l^e} \right)\) and the probability \(P\left( {H_l^t\left| {Y_l^e} \right.} \right)\) could be calculated from the training set sepcifically, as shown in Formulas 3, 4 and 5:

$$P\left( {Y_l^1} \right) = \left( {{\mathrm{s}} +\vec C_s\left( l \right)} \right)/\left( {s + 2 + m} \right);\,P\left( {Y_l^0} \right) = 1 - P\left( {Y_l^e} \right)$$
$$P\left( {H_l^t\left| {Y_l^1} \right.} \right) = \left( {s + c\left[ t \right]} \right)/\left( {s + \left( {k + 1} \right) + \mathop {\sum}\nolimits_{p = 0}^k {c\left[ p \right]} } \right)$$
$$P\left( {H_l^t\left| {Y_l^0} \right.} \right) = \left( {s + c^\prime \left[ t \right]} \right)/\left( {s + \left( {k + 1} \right) + \mathop {\sum}\nolimits_{p = 0}^k {c^\prime \left[ p \right]} } \right)$$

The algorithm presented in this section is an improvement to the MLkNN classifier. For the initial emotion classification of sentences, the improved algorithm uses the in-sentence features as the initial features and combines the MLkNN algorithm to construct the sentence base classifier. The initial result of the emotional classification of the sentence in a tweet is obtained with Formula 6. Both the prior and conditional probabilities of each emotion could be obtained through the training set, as shown in Formulas 3, 4 and 5.

$$\vec y_a\left( l \right) = \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} P\left( {Y_l^e} \right)P\left( {H_l^{\vec C_S\left( l \right)}\left| {Y_l^e} \right.} \right)$$

Adjustment with the probability of emotional transfer between sentences

After the initial classification results of sentences obtained by MLkNN classifier, the classification accuracy of sentences is not yet ideal because only the in-sentence features are considered. This study believes that the emotion category of sentences could be further modified by using the emotion transfer probability of adjacent sentences. We mark the first sentence of sentence s as SP and the last sentence of sentence s as SN. Possible events are defined as follows:

\(P_\varepsilon ^1\): In the previous sentence of the sentence s, SP, the emotion is ε.

\(P_\varepsilon ^0\): In the previous sentence of the sentence s, SP, the emotion is not ε.

\(N_\varepsilon ^1\): In the next sentence of a sentence s, SN, the emotion is ε.

\(N_\varepsilon ^0\): In the next sentence of a sentence s, SN, the emotion is not ε.

Assuming that the emotional transfers between any two adjacent sentences in the tweet are independent of each other, \(\overrightarrow y _s\left( l \right)\) is defined in Formula 7:

$$\begin{array}{ccccc}\\ \vec y_s\left( l \right) = \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} P\left( {\left. {Y_l^e} \right|H_l^{\vec C_S\left( l \right)},\,P_\varepsilon ^{\vec y\left( \varepsilon \right)} \cdots ,\,N_\varepsilon ^{\vec y\left( \varepsilon \right)} \cdots } \right) \\ = \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} P\left( {Y_l^e} \right)P\left( {H_l^{\vec C_S\left( l \right)}\left| {Y_l^e} \right.} \right) \cdot \mathop {\prod}\nolimits_{\varepsilon \in L} {P\left( {P_\varepsilon ^{\vec y\left( \varepsilon \right)}\left| {Y_l^e} \right.} \right) \cdot P\left( {N_\varepsilon ^{\vec y\left( \varepsilon \right)}\left| {Y_l^e} \right.} \right)} \\ \end{array}$$

The calculation with Formula 7 requires the emotion transfer probability between adjacent sentences. To achieve that, the transition probability from emotion x to emotion l could be calculated from the sample set, and the emotion transfer probability could be obtained according to Formula 8 in which the emotional transition probabilities with its previous and next sentences are calculated respectively.

$$\begin{array}{l}P\left( {P_\varepsilon ^1\left| {Y_l^1} \right.} \right) = p\left( {\varepsilon \to l} \right) = \frac{{count\left( {\vec y_{sp}\left( \varepsilon \right) = 1,\,\vec y_s\left( l \right) = 1} \right)}}{{count\left( {\vec y_s\left( l \right) = 1} \right)}}\\ P\left( {N_\varepsilon ^1|Y_l^1} \right) = p\left( {\varepsilon \to l} \right) = \frac{{count\left( {\vec y_{sn}\left( \varepsilon \right) = 1,\,\vec y_s\left( l \right) = 1} \right)}}{{count\left( {\vec y_s\left( l \right) = 1} \right)}}\end{array}$$

Adjustment with the probability of emotional transfer in a full text of the tweet

Similar to the calculation method of the emotion transfer probability between adjacent sentences, the emotion transition probability can be calculated for the overall emotion of the tweet and the emotion of the sentence. Assuming that w is the tweet where the sentence s is located, the possible event is defined as follows:

\(W_\varepsilon ^1\): The emotion of the tweet w where the sentence s contains ε.

\(W_\varepsilon ^0\): The emotion of the tweet w where the sentence s does not contain ε.

Assuming that the emotional transformations between the full text of the tweet and each sentence are independent events, \(\overrightarrow y _s\left( l \right)\) could be defined with Formula 9.

$$\begin{array}{ll}\\ \vec y_s\left( l \right) = \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} P\left( {\left. {Y_l^e} \right|H_l^{\vec C_S\left( l \right)},\,W_\varepsilon ^{\vec y\left( \varepsilon \right)} \cdots } \right) \\ \qquad\,\,= \arg \mathop {{\max }}\limits_{e \in \left\{ {0,1} \right\}} P\left( {Y_l^e} \right)P\left( {H_l^{\vec C_S\left( l \right)}\left| {Y_l^e} \right.} \right) \cdot \mathop {\prod}\nolimits_{\varepsilon \in L} {P\left( {W_\varepsilon ^{\vec y\left( \varepsilon \right)}\left| {Y_l^e} \right.} \right)} \end{array}$$

The emotion transfer probability from ε to l could be calculated by Formula 10:

$$p\left( {\varepsilon \to l} \right) = \frac{{{\mathrm{count}}\left[ {\vec y_w\left( \varepsilon \right) = 1,\,\vec y_s\left( l \right) = 1} \right]}}{{{\mathrm{count}}\left[ {\vec y_s\left( l \right) = 1} \right]}}$$

Adjustment to MLkNN

The classification model based on MLkNN regards that there is no relationship between multiple labels of the same text, and does not consider the influence of the relationship between its labels for each instance. To improve that, the label correlation of the instance itself is taken as a related factor of multi-label emotion classification to further enhance the accuracy. First, we investigate the correlation of multiple labels of each instance in the training set. The existing solution strategies to investigate the relevance of labels could be roughly divided into three categories according to the complexity of calculation (Zhang and Zhang, 2010; Yang et al., 2019):

  1. (1)

    The first-order strategy examines each label in turn and decomposes the multi-label learning problem into an independent binary classification problem. This method is easy to implement, but difficult to generalize.

  2. (2)

    The second-order strategy examines each case of label pairwise combination. This method takes into account the correlation between labels but does not include all inclusion cases of labels.

  3. (3)

    High-order strategy, which investigates the high-order correlation between labels, is more comprehensive than the above two. However, it also brings high computational complexity, which is difficult to be applied to large-scale learning problems.

Taking into consideration that there is often correlation between multiple sentiment tags of a text, this study adopts the second-order strategy. For all labels set L={L1, L2, ···, Lq}, the total number of possible combinations is q(q−1)/2. Lnew is defined as |Lnew| = q(q+1)/2. For any instance in the training set, the corresponding 0 / 1 labels for L1 to Lq remain the same as before expansion. The label sets from Lq+1 to Lq(q+1)/2 need further adjustment: we label 0 for those with two labels Li and Lj(1 ≤ i ≤ q, 1 ≤ j ≤ q, i ≠ j), otherwise, 0.

Word co-occurrence patterns proposed by Baeza-Yates and Ribeiro-Neto (1999) are applied as a measure of emotion correlation between labels for automatic local analysis. The co-occurrence matrix M composed by the emotion labels and instances is shown in Table 1.

Table 1 Co-occurrence matrix M.

In the table, xi(1 ≤ i ≤ n) represents the i-th sample in the training set and ωj(1 ≤ j ≤ q) represents all data from row j in the matrix M. By normalizing them, the common co-occurrence frequency of the labels Lu and Lv is shown in Formulas 11 and 12:

$$C_{\omega u,\omega v} = {\sum} {M_{ui} \times M_{vi}}$$
$$S_{\omega u,\omega v} = \frac{{C_{\omega u,\omega v}}}{{C_{\omega u,\omega u} + C_{\omega v,\omega v} - C_{\omega u,\omega v}}}$$

Here, Sωu,ωv indicates the frequency of ωu and ωv co-occurrence and is shown as a symmetry matrix Q1ij with all the diagonal elements as 1. YY1(i) and YY0(i) are obtained with Formulas 13 and 14, in which PHY1 and PHY2 are obtained with Formulas 3 and 4.

$$YY1\left( i \right){{{\mathrm{ = }}}}{\sum} {Q1_{ij} \times PHY1\left( {ij,d} \right)}$$
$$YY0\left( i \right){{{\mathrm{ = }}}}{\sum} {Q0_{ij} \times PHY0\left( {ij,d} \right)}$$

Assuming that a parameter α satisfies 0 ≤ α ≤ 1, the previously obtained YY1(i) and YY0(i) are combined with Y1(i) and Y0(i) calculated in MLkNN. y1(i) and y0(i) are obtained with Formulas 15 and 16 to judge whether the test instance contains the label Li.

$$y1\left( i \right){{{\mathrm{ = }}}}\alpha \times Y1\left( i \right) + \left( {1 - \alpha } \right) \times YY1\left( i \right)$$
$$y0\left( i \right){{{\mathrm{ = }}}}\alpha \times Y0\left( i \right) + \left( {1 - \alpha } \right) \times YY0\left( i \right)$$

The sentence-level emotion classification results of the tweet obtained in section “Adjustment with the probability of emotional transfer in a full text of the tweet” are integrated and calculated according to Formula 17 to obtain the full text of the tweet-level emotion classification results, which are then used as the basic classification results in the improved algorithm for further improvement.

$$\vec y_w\left( l \right) = \frac{1}{p}\mathop {\sum}\limits_{i = 0}^p {\vec y_s\left( l \right)}$$

Here, p is the total number of sentences in each tweet.

Experimental study

Experimental settings

The Sentiment140 Twitter corpusFootnote 1 contains 1,600,000 tweets extracted using the Twitter API (Go et al., 2009). In this study, 8000 Twitter texts were randomly selected from Sentiment140 for annotation. After filtering the meaningless text, 6500 Twitter texts were finally retained, containing a total of 11,338 sentences, which cover a wide range of content and have the common text characteristics of short Internet texts. Each complete tweet instance contains up to two emotion labels, and each sentence in the instance contains up to one emotion label. According to the experimental needs, the dataset is divided into two parts: training set and test set, with a corpus ratio of 7:3. The training set contains 4500 tweet data and 7779 sentences, and the test set contains 2000 tweet data and 3559 sentences.

The multi-label classification evaluation indicators are mainly divided into two categories: sample-based and label-based indicators. The sample-based indicators mainly consider the evaluation results of each sample, and then take the average of multiple samples. The label-based indicators mainly consider the performance of a single label on all samples, and then take the average of multiple labels. The experiments in this section mainly evaluate the performance of multi-label classification with the sample-based indicators, which include Subset Accuracy (SA), Hamming Loss (HL), One-Error (OE), Ranking Loss (RL), Average Precision (AVP), Accuracy (AC), Precision (PR), Recall (RE) and F-score. Specifically, SA is the parameter that measures the accuracy rate. HL measures the proportion of misclassified labels. OE refers to the proportion of samples predicting the most relevant labels that are not present in the real labels. RL indicates the situation when the correlations of uncorrelated labels are scored higher than related labels. AVP means the proportion of the predictions when relevant labels rank higher than a chosen label. Accuracy, Precision, Recall and F-score are the extension of accuracy, precision, recall and F-value in the single-label classification task. The calculation methods are introduced respectively:

$$SA = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {1\left\{ {h\left( {x^i} \right) = y^i} \right\}}$$

p indicates the sample size of the test set 1{π} returns 1 when π is true and 0 otherwise.

$$HL = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {\frac{1}{q}\left| {h\left( {x^i} \right)\Delta y^i} \right|}$$

q represents the total number of all labels, Δ means exclusive OR, yi represents the set of actual labels of sample i, and h(xi) represents the set of predicted labels of sample i.

$$OE = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {1\left\{ {\left[ {\arg \mathop {{\max }}\limits_{y_i \in Y} f\left( {x^i,y_j} \right)} \right] \notin y^i} \right\}}$$
$$RL = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {\frac{1}{{\left| {y^i} \right|\left| {\overline {y^i} } \right|}}} \left| {\left\{ \begin{array}{l}\left. {\left( {y_{j1},\,y_{j2}} \right)} \right|f\left( {x^i,y_{j1}} \right) \le f\left( {x^i,\,y_{j2}} \right),\\ \left( {y_{j1},y_{j2}} \right) \in \left( {y^i \times \overline {y^i} } \right)\end{array} \right\}} \right|$$
$$AVP = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {\frac{1}{{\left| {y^i} \right|}}} \mathop {\sum}\limits_{y_{j1} \in y^i} {\frac{{\left| {\left\{ {\left. {y_{j2}} \right|rank_f\left( {x^i,\,y_{j2}} \right) \le rank_f\left( {x^i,\,y_{j1}} \right),y_{j2} \in y^i} \right\}} \right|}}{{rank_f\left( {x^i,\,y_{j1}} \right)}}}$$
$$AC = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {\frac{{\left| {h\left( {x^i} \right) \cap y^i} \right|}}{{\left| {h\left( {x^i} \right) \cup y^i} \right|}}}$$
$$PR = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {\frac{{\left| {h\left( {x^i} \right) \cap y^i} \right|}}{{\left| {h\left( {x^i} \right)} \right|}}}$$
$$RE = \frac{1}{p}\mathop {\sum}\limits_{i = 1}^p {\frac{{\left| {h\left( {x^i} \right) \cap y^i} \right|}}{{\left| {y^i} \right|}}}$$
$$F{{{\mathrm{ - }}}}score = \frac{{\left( {1 + \beta ^2} \right) \cdot PR \cdot RE}}{{\beta ^2 \cdot \left( {PR + RE} \right)}}$$

To test the effectiveness of the algorithm, three sets of experiments were designed in this section to evaluate the performance of the sentiment analysis algorithm.

In the first group of experiments, the unary grammar features of words were used as the main features, and the characteristics of the combination of the unary grammar features with the adjacent sentences, the combination of the unary grammar features with the full text of the tweet features, and the combination of the unary grammar features with the adjacent sentences and the full text of the tweet features were applied to conduct the experiments. Based on the data characteristics and empirical K-value selection (Bansal et al., 2022), K = 5 was set as the benchmark, and a bidirectional test was conducted. According to the test results, K = 5 and K = 8 were selected as the parameter indicators in further experimentation.

In the second group of experiments, the features of MLkNN classifier select the combined features of unary grammar and binary grammar, and evaluate the situation of “unary + binary + adjacent sentences”, “unary + binary + chapter” and “unary + binary + adjacent sentences + text”. Finally, the adjacent sentence features and the full text of the tweet features are combined. The number of nearest neighbors K is 5 and 8, respectively. The results given in the table are the results when the average accuracy is iterated to convergence. This study labels the method with above procedure as S-MLkNN.

In the third group of experiments, the MLkNN classifier combining unary grammar features and binary grammar features is selected as the initial classifier of sentences, and the overall emotion of a tweet is classified by combining the adjacent sentences and text features. On this basis, the overall emotion label of a tweet is further modified through label correlation. Different nearest neighbor numbers K and different α parameters of the training set are selected to obtain the corresponding emotion transfer matrix. This study labels the method with the above procedure as L-MLkNN.


Emotion classification with basic MLkNN

As shown in Table 2, the average accuracy could only reach about 42% if just unary grammar features are used in the MLkNN classifier. The use of the emotion transfer features of the adjacent sentences or the whole tweet could help a lot to improve the accuracy. When the value of K is 5, the accuracy rate is increased by about 13%, and the accuracy rate was further improved by 15% when the value of K is 8. Regretfully, the effect is not obvious compared with the classification results of the baseline system. It is mainly affected by the inaccurate initial classification, which has a relatively strong impact on the final classification result. This also indicates that when the classification accuracy of sentences is high, the accuracy of emotion classification could be better improved by considering the emotional transfer of the text through the characteristics of the adjacent sentences and the whole text.

Table 2 Experimental results of emotion classification with basic MLkNN.

Emotion classification with S-MLkNN

The second experiment proves that the binary grammar features could result in better initial classification results than the single unary grammar features as shown in Table 3. In this group of experiments, the selection of K-value, however, has little effect on the results.

Table 3 Emotion analysis results of S-MLkNN.

Emotion classification with L-MLkNN

Table 4 shows the indicators after modifying the emotional label according to the relevance of the labels. The value of α will affect the final emotion analysis result according to the combination of base classifiers and the number of neighbors. When the value of α is 0.7, a higher accuracy could be obtained, and HL also reaches the lowest level in several groups of experiments. In addition, the value of K also impacts the results greatly. When k is 8, better HL and OE values could be achieved since more neighboring label sets are considered. However, the corresponding training time is longer and the training cost is higher.

Table 4 Emotion analysis results of L-MLkNN.


This study compares the performance of the different algorithms on the Twitter dataset. If the number of predicted labels in the classification result of the classifier is more than two, only the emotional labels ranked in the top two are selected as the emotional labels of the text, that is, all texts are considered according to a maximum of two emotional labels. Table 5 shows the performance when the value of K in the MLkNN base classifier is 8 and the value of α is 0.7. The results in Table 5 show that the improved L-MLkNN algorithm outperforms the other methods in the overall performance, where the advantage in RE is relatively more pronounced.

Table 5 Comparison of the performance of different algorithms.

From all the experimental results, it is shown that the improved MLkNN algorithm can effectively boost the accuracy of emotion classification of short texts. After correction according to the relevance of labels, the classification effect of multi-labels in the sample set is significantly improved. However, most of the Twitter samples in the corpus are still single-labeled. The limited proportion of multi-label samples in the training set leads to a small co-occurrence probability matrix, so the performance improvement effect of the classifier is not obvious for the corpus as a whole.

What is worth noticing is the performance of the three methods when classifying different texts. In order to better show the performance of the method on different lengths, the test set is grouped according to the length of the texts (the text number and classification result of each group is 8 and α is 0.7). The corresponding F1 value is shown in Fig. 2. MLkNN is the base classifier classification result, S-MLkNN corrects the adjacent sentences and chapter features, and L-MLkNN corrects the classification result based on the label correlation. Figure 2 shows that the performance of the three methods varies little when the targeted texts are short texts. When classifying long texts, the performance of the improved algorithm is significantly higher compared with the base classifier.

Fig. 2
figure 2

Performance of classification for texts with different amount words (F1 value).

The experimental verification on the Twitter corpus shows that on the same dataset, the method proposed in this chapter achieves better classification results than the traditional multi-label classifier. However, this method still highly relies on the labeling of the training set. From a machine-learning perspective, this is supervised learning. When the classification accuracy of the initial sentence is not high, the classification results might be disappointing. How to use smaller training samples and lower training costs to obtain efficient classifiers, and the performance of such methods on semi-supervised learning or even unsupervised learning are worth investigating further.

It has to be acknowledged that the small sample size of the dataset in this study may not be representative of the entire dataset. The model’s ability to generalize to new data may be limited as a small dataset may not contain enough variation in the data, and potential sampling bias may exist. Applying the results of this research rashly to large data sets may not be robust enough when encountering outliers or unexpected data. In future studies, the scholars should pay more attention to balancing the efficiency of the model with smaller training sample sizes and the completeness of the model to cover various scenarios.


The process of computationally identifying and categorizing opinions expressed in a piece of text is important to provide a better understanding and services to online users. By considering not only individual in-sentence features but also the features in the adjacent sentences and the full text of the tweet, this study adjusts the MLkNN classifier to allow iterative corrections of the multi-label emotion classification and applies the new method to improve both the accuracy and speed of the emotion classification for short texts in Twitter. Except for the adjustments based on the emotion transfers, this study further takes the correlation between multiple emotion labels into consideration and iteratively updates the overall classification results. By carrying out three groups of experiments on the Twitter corpus, this study compares the performances of the base classifier of MLkNN, the sample-based MLkNN (S-MLkNN) and the label-based MLkNN(L-MLkNN). It is proven that the experiments offer the best performance when the value of K in the MLkNN base classifier is 8 and the value of α is 0.7.

This study is an attempt to obtain an efficient classifier faster and more accurately. This method still works or even performs better for long texts. However, further work still needs to be carried out to improve semi-supervised learning or unsupervised learning algorithms, and to achieve better performance with smaller training samples and lower training costs.