A hybrid attention and dilated convolution framework for entity and relation extraction and mining

Mining entity and relation from unstructured text is important for knowledge graph construction and expansion. Recent approaches have achieved promising performance while still suffering from inherent limitations, such as the computation efficiency and redundancy of relation prediction. In this paper, we propose a novel hybrid attention and dilated convolution network (HADNet), an end-to-end solution for entity and relation extraction and mining. HADNet designs a novel encoder architecture integrated with an attention mechanism, dilated convolutions, and gated unit to further improve computation efficiency, which achieves an effective global receptive field while considering local context. For the decoder, we decompose the task into three phases, relation prediction, entity recognition and relation determination. We evaluate our proposed model using two public real-world datasets that the experimental results demonstrate the effectiveness of the proposed model.

• Existing improvements in relation extraction performance mainly rely on pre-trained language models such as bi-directional encoder representations from transformer (BERT) 11 , which is at the price of much time cost and memory consumption within GPUs.Designing a computation efficient solution for entity and relation extraction is meaningful.• There are some overlapped entity pairs in some triplets.The models focus on the case that none of the triplets have overlapped entities and could not obtain satisfactory results.
To address the above problems, we propose a Hybrid Attention and Dilated convolution Framework (HADNet) in this paper.HADNet designs a novel encoder architecture that combines self-attention and a multi-scale extraction module with dilated convolution and a gated unit.HADNet has the advantage of computation efficiency.It enables us to enjoy global receptive fields meanwhile utilize the local context, consequently makes accurate entity and relation extraction.The contributions of this work are summarized as follows.
The rest of the paper is organized as follows: Section "Related work" gives the related work; the proposed model is introduced in Section "Model training", and experimental evaluations are presented in Section "Experimental studies"; Section "Conclusion" concludes the paper.

Related work
Entity and relation extraction is a fundamental problem in knowledge construction and has attracted extensive research attention during the past decades.Earlier work is usually based on the pipeline 1-3 .For example, Zelenko et al. 1 propose to use the devised kernels in conjunction with Support Vector Machine and Voted Perceptron learning algorithms to extract person-affiliation and organization-location relations from text.Chan et al. 3 propose an algorithm that first identifies structures in the text and then identifies the semantic type of the relation with the extracted structures.However, pipeline methods do not consider the correlations between operations of entity recolonization and relation extraction.
Recently, deep learning has proven very effective in feature extraction and representation learning 4,12,13 .Many deep learning based approaches for entity and relation extraction have been proposed.Existing deep learning approaches usually adopt a CNN to encode sentence semantics, RNN, and its variant LSTM to model the temporal correlation of words in the sentence.In particular, Zeng et al. 6 use CNN for relation classification.Xu et al. 7 employ RNN for relation classification.Lin et al. 14 propose a sentence-level attention method to make full use of the related information in all sentences and calculate the weighted sum of all sentences.Guo et al. 15 introduce an entity recognition function to further obtain entity background knowledge and improve relation extraction performance.Xiao et al. 16 propose a hybrid deep neural network model to jointly extract the entities and relations, moreover, the model is capable of filtering noisy data caused by distant supervision.Attention mechanism and graph neural network have been popular in recent years [17][18][19] .Xiao et al. 20 propose an attention-based transformer block model for distant supervision relation extraction.The model could achieve richer vector expressions for each sentence and better address the wrong labeling problem.Zheng et al. 21propose a weighted relative position transformer encoder to capture the semantic relationship between entities flexibly.
Advancement in relation extraction enable applying it in many domains, such as recommendation system, and finding potential therapeutic targets in diseases 19,22,23 , etc.However, due to the influence of the structure, these approaches may not obtain satisfactory results.For example, CNN cannot encode temporal information between words in each pair of sentences, while RNN greatly prolongs the training time of the model because words need to be added to the calculation in the sequence, which makes it very difficult to encode long and complex sentences 24 .Moreover, RNN based method lacks fine-grained feature extraction.
Most of the existing works focus on relational triple extraction of sentences containing single triples, while few methods consider the problem of overlapping triples in the same sentence.Due to the complexity of languages, there may be more than one pair of entity pairs and relational triples in a single text, which means that there are multiple triples in a sentence.Zeng et al. 25 propose the concept of overlapping triples and design a sequenceto-sequence model with copy mechanism.Zheng et al. 26 propose to directly model triples as a whole and solve the entity and relation extraction problem.However, relational triples are still regarded as discrete labels, which results in excessive negative cases in model training and influence the extraction performance.
However, Transformer based pre-trained language models such as BERT consume much computation resources and memory in GPUs, which influences the training efficiency.Furthermore, existing two stages based relation extraction usually applies extraction operation to all relations, which results in much redundancy.

Methodology
The overall architecture of the proposed model is shown in Fig. 1, which adopts attention based Encoder-Decoder architecture.It first encodes a sentence into a fixed-length vector representation with an attention mechanism.After that, entity and relation extraction are conducted in the decoder.

HADNet encoder
Self-Attention is efficient to process natural language data.To further improve the performance of entity and relation extraction, the encoder integrates self-attention with a multi-scale extraction module (MSE) to achieve a global receptive field and utilizes local context.Figure 1 presents the overall framework of the encoder, which is composed of L context-aware self-attention blocks.In each block, the multi-head attention focuses on the global features extraction, while MSE adopts dilated convolutions with the gated unit to capture local features.
Given the sentence, the output of the encoder is hidden states H.The details of the components of the encoder are described in the remaining subsections.where d k is the dimension of keys and values.Finally, the outputs are concatenated and further projected to obtain the final output: where m is the number of attention heads.W Q i , W K i and W V i are projection matrices used on Q, K, and V. W O is the final output projection matrix.The multi-head attention is efficient to capture the global features as it models the correlations of elements in sentences without considering their distance.However, the multi-head ignores the local trend information inherent in the sentence.To address the above problem, we further add a multi-scale extraction module, which considers the local contextual information.

Multi-scale extraction module
To enhance the fine-grained coding ability of the model and capture more accurate correlations, we design the multi-scale extraction module (MSE).MSE can capture multi-scale local information in sentences explicitly, as shown in Fig. 2. The MSE applies dilated convolutions with gated units to exploit features in different scales of receptive fields 28 .In particular, we utilize two dilated convolutions at each layer to transform the input feature.After that, the features learned from different scales are fused employing residuals and gated unit to achieve a multi-scale representation 28 , which is denoted as:

HADNet decoder
In this section, we describe the design of HADNet decoder that consist of three components.
(1) where W r is the trainable weight, b r is the bias vector.Inspired by the cascade tagging method 10 , we further tag the relation with a threshold T r .As shown in Fig. 1 , the probability tagger is set to 1 if its value is higher than T r or set to 0 if its value is lower than T r .

Entity recognition
The entity recognition component aims to extract subjects and objects as the sequence tagging task.Let h i denote the representation of the i-th token, where r j is the j-th relation representation, W s and W o are trainable weights.

Relation determination
In the previous subsection, we have obtained possible subjects and objects according to their potential relations in the sentence.Next, we will capture the inter-dependencies between the subject and object pairs.Let h s i and h o j respectively denote the i-th token and j-th token in the sentence, and they form potential subject and object pair, cosine similarity between two entities is used as aggregated weights 29 , where w ij is the weight matrix.Next, we determine the relation by comparing it with a threshold T d .As shown in Fig. 1, if Pij is higher the T d , then the corresponding subject and object pair will be remained or be removed if the value is lower than T d .

Model training
The loss function is composed of three parts as follows: where L rp , L e , and L rd are the loss of relation prediction component, entity recognition component, and relation determination component, respectively, which are obtained via taking the log of the probabilities.

Experimental studies
In this section, we report the experimental results of the proposed HADNet.We first introduce the Datasets, the experimental setting and the baselines.After that, we present the experimental results and evaluation discussions.

Datasets
To test the performance of our proposed model, we use two public real-world datasets WebNLG 30 and New York Times (NYT) 31 .
• NYT: It contains 24 predefined relation types, the dataset consists of 1.18M sentences of news articles from 1987-2007 New York Times.And it is produced by the distant supervision method.We follow the existing work's preprocessing steps 10 to split the dataset, sentences for training, validation, and test are 56195, 5000 and 5000 respectively.• WebNLG: It is created for natural language processing tasks, and Zeng et al. 25

Baselines and evaluation metrics
We compare HADNet with the following widely used baselines.All the experimental results of the baseline methods are directly obtained from 10 unless specified.• CASREL random 10 : Cascade binary tagging framework when all parameters of BERT are randomly initialized.
Following Zheng et al. 26 , the performance of different models is evaluated with the following metrics: precision, recall, and F1 scores.

Experimental results
Tables 1 and 2 show the precision, recall, and F1 scores of our proposed model as compared to other baselines on WebNLG and NYT datasets, respectively.From the tables, we can draw the following conclusions: (1) Our HADNet model outperforms the state-of-the-art models not based on BERT.Only in the WebNLG dataset, recall score of HADNet is slightly lower than that of the CopyR RL model, while both precision score and F1 score are higher than that of the CopyR RL model.Besides, there are 12 and 3% improvements in F1 values over the two datasets, respectively.(2) While comparing with model based on BERT (CASREL random ), in the WebNLG dataset, our model obtains the best precision score, which is 88.8.There is 3% improvement compared with the CAS-REL random model, which is 84.7.In the NYT dataset, the precision score of our model is 81.2, which is competitive with the precision score of the CASREL random model (81.5).These facts imply the effectiveness of our models.
Figures 3 and 4 show the performance of different models under different evaluation metrics over the two datasets.We can see that our model has better performance than other models not using BERT as the pr-trained model.It is also observed that the performance on NYT of HADNet is not good as the model based on the BERT.Following previous works 10,33,34 , we further conduct experiments over the NYT dataset to explore the performance of HADNet for solving overlapping problems, and the results are shown in Table 3.We can see that our model obtains satisfactory performance under different overlapping patterns.Moreover, the performance even improves under the EPO and SEO patterns.It implies that our model is competitive in solving the extraction task of overlapping triples.
Table 4 shows the F1-score of sentences with different numbers of triples, where N is the number of triples in a sentence.Compared with the baseline models, HADNet achieves excellent results when N varies from 1 to 4. From Fig. 5, we can see that the performance of all models is the best when N = 1 , while the performance declines considerably with the increasing of N.Although our model declines with the increasing of N as well, it is still better than other baselines and obtains consistently better performance with different N.

Conclusion
Entity and relation extraction have attracted continuous attentions in recent years.However, the overlapping triples problem and training efficiency propose challenges for it.To tackle these problems, in this paper, we proposed a novel Hybrid Attention and Dilated convolution Network (HADNet), which considered the computation efficiency and overlapping triples while maintaining competitive performance.In particular, We first designed a novel encoder that combined self-attention with dilated convolution and a gated unit for efficient relation extraction.Then, we employed cosine similarity schemes to determine relations.Finally, when evaluated on two real-world datasets, the proposed model achieved better results than state-of-the-art baselines that do not use BERT as a pre-trained model.For our future work, we intend to explore jointly learns named entities and relations based on graph convolutional networks.Moreover, we plan to conduct the proposed model on more datasets to verify the universality and effectiveness of our method.
https://doi.org/10.1038/s41598-023-40474-1www.nature.com/scientificreports/Multi-head attention Multi-head attention is widely used in many self-attention mechanism based applications27 .It aims to aggregate the information of the previous layer, which first maps the queries, keys, and values into three representation subspaces, namely Q, K, and V through m different linear transformations.Then, the attention function is performed in parallel27 :

Figure 1 .
Figure 1.The architecture of HADNet.HADNet follows an encoder-decoder structure.The encoder stacks multiple self-attention blocks (i.e., blue blocks) and multi-scale extraction blocks (i.e, pink blocks).Given a sentence, the output of the encoder is the embedding H, which is fed into the HADNet decoder.The decoder is decomposed into three components, the relation prediction component generates potential relations.Based on this, the entity recognition component tags subjects and objects.Finally, the relation determination component remains the corrected subjects and object pairs.
= sigmoid(W r H + b r )

Figure 3 .
Figure 3. Results of different models (without BERT) over WebNLG and NYT datasets.

N = 1 N = 2 N = 3 N = 4 N
= 1 N = 2 N = 3 N = 4 10rst utilizes it for relation extraction tasks.The dataset contains 246 valid relations.We follow Wei et al.10's preprocessing steps to split the dataset, sentences for training, validation and test are 5019, 500 and 703 respectively.

•
NovelTagging 26 : Sequence annotation relational triple extraction based on entity relation joint decoding.• CopyR OneDecoder 25 : End-to-end relational extraction model based on a single decoder.• CopyR MultiDecoder 25 : End-to-end relational triple extraction model based on multiple decoders.• GraphRel 1p 8 : Relational triple extraction model based on graph convolutional neural network.• GraphRel 2p 8 : Graph convolutional neural network model for relational triple extraction based on fusing relation weighted vector.• CopyR RL 32 : Relational triple extraction model based on reinforcement learning.

Table 1 .
Results of different models (without BERT) over WebNLG and NYT datasets.The [bold] values is the the best result after comparing each method.

Table 2 .
Results of HADNet and CASREL (with BERT) over WebNLG and NYT dataset.The[bold]values is the the best result after comparing each method.The reason is that HADNet adopts a simple and efficient mechanism to approximate pre-trained model BERT, which results in limited representation ability.Nonetheless, HADNet still outperforms models not based on BERT such as CopyR RL , GraphRel, and close to CASREL random .It implies that self-attention and dilated convolution based model is able to achieve stable and competitive expression ability.

Table 3 .
Results of different overlapping patterns over the NYT dataset.

Table 4 .
F1-score of sentences with different numbers of triples.