BioBBC: a multi-feature model that enhances the detection of biomedical entities

The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.

displays the results.In case 1, the models should recognize entities of disease type.PubTator could not recognize the entity's boundary as it missed the token "congenital", whereas BioBBC correctly identified the multi-token entity "congenital DM".In the second case, the example contains three multi-token disease entities.PubTator failed to recognize the third entity, "prion diseases", while BioBBC correctly recognized all the entities with the correct boundary.In case 3, while the PubTator recognized the chemical entities "Allopurinol" and "Thioctic Acid", it failed to recognize their abbreviations "ALO" and "THA".At the same time, BioBBC successfully detects the chemical names' entities and their abbreviations' entities.Case 4 contains six chemical entities.
While PubTator detects most of the entities, it failed to detect the multi-token chemical entity name "thiobarbituric-acid-reactive-substances" by ignoring the entity and all its parts.BioBBC, on the other hand, detected all the entities appearing in the sentence.In case 5, PubTator also could not recognize the entity's boundary as it did not pick up the token "prepro", whereas our system correctly recognizes this token.Here our model learned the relations between "prepro" and "AVP-NPII" by considering the syntactic and semantic features of the sentence.In case 6, PubTator failed to recognize the gene's entity, "glucocorticoid receptors", whereas our model correctly recognized the entity and its abbreviation, "GR".These examples show the robustness of BioBBC and its ability to better learn the syntactic and semantic information about the sentence, which extends to its ability to better recognize the abbreviation of the entities as well.

BioBBC
The study demonstrated a decreased level of glucocorticoid receptors (GR) in peripheral blood lymphocytes

• Example of paragraph
To further show the robustness of our system, we also conducted a case study using long paragraphs taken from two studies 1,2 .Since we have separate models for each entity type, we conducted this experiment by sequentially inputting the text into three models trained by NCBI-Disease, BC5CDR-Chem, and BC2GM.We then merged the resulting labels from these systems.

Figures S1 and S2
show the results of this case study.While we did not create a GUI system yet, we colored the results manually to match the color system of PubTator.Specifically, we used orange, green, and purple highlights to annotate the predicted disease, chemical, and gene entities, respectively.Additionally, we used blue highlight to point out the ambiguous entities in our system.
Figure S1 shows that BioBBC correctly detected more biomedical entities than PubTator.
Specifically, for disease entities, PubTator annotated the names of the diseases "Type 2 Diabetes Mellitus" and "Alzheimer's disease", but it ignored the mention of their abbreviations "T2D" and "AD".In contrast, BioBBC detected all the mentions of the diseases and their abbreviation.
Additionally, for chemical entities, PubTator wrongly annotated "DEGs" and "hsa-mir-129-2-3p" as chemical components, as the "DEGs" is a general term and "hsa-mir-129-2-3p" should be annotated as a gene.Finally, for the gene entities, as seen, while PubTator annotated some microRNAs as genes, it missed many other entities.In constraints, BioBBC successfully detected all the gene entities with correct boundaries.However, here we notice that while BioBBC has correctly detected the entity "hsa-mir-103a-3p" as a gene, it had wrongly detected the token "103a", highlighted in blue, as a chemical entity when we tested this text in our chemical model.
Example 2 (Figure S2) shows that BioBBC and PubTator recognized the disease entities correctly.For genes, while PubTator detected FGFR as a gene, it missed many mentions of this gene.BioBBC correctly detected all the gene entities.Moreover, BioBBC detected the entity "Fibroblast growth factor receptors" and "tyrosine kinase" as gene entities.However, in this example, BioBBC also annotated "tyrosine" as a chemical entity.For the chemicals, PubTator detected the first occurrence of the chemical entity "PD173074" and missed the second one.This example demonstrates a PubTator's consistency problem that is not apparent for BioBBC, which correctly detected all the occurrences of the chemical entities.
The examples in the case study section show how our system BioBBC improves the performance of BioNER by correctly detecting and annotating biomedical entities.Moreover, compared with PubTator, our model, BioBBC, performs better in recognizing more biomedical entities and understanding the structure of the text by effectively learning syntactic and semantic features.Lastly, our model efficiently recognizes the entity boundaries, abbreviations of biomedical entities and avoids inconsistency problems.These advantages demonstrate the benefit of BioBBC in solving BioNER tasks.

• Example of error cases produced by BioBBC
Here, we present some error cases produced by BioBBC (see Table S2).In the first sentence, the model wrongly annotated the term 'cocaine' as a disease.In the second example, while the term 'amino acid' is detected as chemical, the model missed its abbreviation 'AA'.In the third example, the word 'reduced' is captured as the first word of the entity; the boundary expansion in this example may occur due to the syntactic information captured in our model, which denotes the word reduced as a part of the phrase.In the last example, the term 'insulins' was wrongly detected as a gene.The error cases show room for improvement in the BioNER.Specifically, the information captured in our model could be improved since the syntactic information of the complex biomedical sentences was extracted using a general domain tool.Also, using one-hot encoding for the POS tags may have some limitations.However, we conducted a preliminary experiment to compare one-hot vectors with Glove embeddings and observed a decrease in performance when using Glove.This outcome suggests that further configuration might be necessary.The drop in performance could be attributed to our initial optimization for one-hot encoders, indicating that our parameters and settings may need further adjustment to use other embedding methods.
3. Chemical/DrugPubTator the binary mixture of Allopurinol ( ALO ) and Thioctic Acid ( THA ) .objective : a comprehensive stabilityindicating HPLC -DAD procedure has been executed for concurrent analysis of ALO and THA PubTator3 the binary mixture of Allopurinol ( ALO ) and Thioctic Acid ( THA ) .objective : a comprehensive stabilityindicating HPLC -DAD procedure has been executed for concurrent analysis of ALO and THA BioBBC the binary mixture of Allopurinol ( ALO ) and Thioctic Acid ( THA ) .objective : a comprehensive stabilityindicating HPLC -DAD procedure has been executed for concurrent analysis of ALO and THA 4. Chemical/Drug PubTator Then, the brain homogenates content of thiobarbituric-acidreactive-substances (TBARS), 4-Hydroxy-2-nonenal (4-HNE) and acetylcholine (ACh)/acetylcholine… PubTator3 Then, the brain homogenates content of thiobarbituric-acidreactive-substances (TBARS), 4-Hydroxy-2-nonenal (4-HNE) and acetylcholine (ACh)/acetylcholine… BioBBC Then, the brain homogenates content of thiobarbituric-acidreactive-substances (TBARS), 4-Hydroxy-2-nonenal (4-HNE) and acetylcholine (ACh)/acetylcholine… 5. Gene PubTator found in both the signal peptide of the prepro-AVP-NPII precursor and within NPII itself PubTator3 found in both the signal peptide of the prepro-AVP-NPII precursor and within NPII itself BioBBC found in both the signal peptide of the prepro-AVP-NPII precursor and within NPII itself 6. Gene PubTator The study demonstrated a decreased level of glucocorticoid receptors (GR) in peripheral blood lymphocytes PubTator3 The study demonstrated a decreased level of glucocorticoid receptors (GR) in peripheral blood lymphocytes

Table S1 .
Example of test sentences.
BioBBC..transmission of congenital DM is rare and preferentially occurs with onset of DM..2.DiseasePubTator such as Alzheimer's disease, amyotrophic lateral sclerosis, the prion diseases, and… PubTator3 such as Alzheimer's disease, amyotrophic lateral sclerosis, the prion diseases, and… BioBBC such as Alzheimer's disease, amyotrophic lateral sclerosis, the prion diseases, and…

Table S2 .
Example of error cases.