Comparison of the accuracy of Japanese synonym identifications using word embeddings in the radiological technology field

Yagahara, Ayako; Yokohama, Noriya

doi:10.1038/s41598-023-49708-8

Download PDF

Article
Open access
Published: 16 December 2023

Comparison of the accuracy of Japanese synonym identifications using word embeddings in the radiological technology field

Ayako Yagahara¹ &
Noriya Yokohama²

Scientific Reports volume 13, Article number: 22408 (2023) Cite this article

384 Accesses
Metrics details

Subjects

Abstract

The terminology in radiological technology is crucial, encompassing a broad range of principles from radiation to medical imaging, and involving various specialists. This study aimed to evaluate the accuracy of automatic synonym detection considering the characteristics of the Japanese language by Word2vec and fastText in the radiological technology field for the terminology elaboration. We collected around 340 thousand abstracts in Japanese. First, preprocessing of the abstract data was performed. Then, training models were created with Word2vec and fastText with different architectures: continuous bag-of-words (CBOW) and skip-gram, and vector sizes. Baseline synonym sets were curated by two experts, utilizing terminology resources specific to radiological technology. A term in the dataset input into the generated models, and the top-10 synonym candidates which had high cosine similarities were obtained. Subsequently, precision, recall, F1-score, and accuracy for each model were calculated. The fastText model with CBOW at 300 dimensions was most precise in synonym detection, excelling in cases with shared n-grams. Conversely, fastText with skip-gram and Word2vec were favored for synonyms without common n-grams. In radiological technology, where n-grams are prevalent, fastText with CBOW proved advantageous, while in informatics, characterized by abbreviations and transliterations, Word2vec with CBOW was more effective.

BioWordVec, improving biomedical word embeddings with subword information and MeSH

Article Open access 10 May 2019

A deep database of medical abbreviations and acronyms for natural language processing

Article Open access 02 June 2021

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Article Open access 20 October 2021

Clinical standard terms and ontologies are necessary for computing systems to promote unhindered communication, improve workflows, and build applications for quality control, clinical decision support, and clinical trials^1,2,3. Natural language allows for a widely varied expressiveness but at the same time may be ambiguous, and jargon and acronyms are used in medical settings⁴. Non-standardized reporting of image observations in patients are a cause of risks in diagnoses with inconsistent communication and inaccuracies in risk estimation by medical experts⁵. Radiological technology is a field covering principles of radiation/magnetic field/ultrasound, engineering for radiation therapy, and imaging technology in addition to medical knowledge. This gives radiological technology a substantial input from engineering among medical fields, and a variety of specialists are involved in this area. Due to the above, the role of terminology in radiological technology is important.

The terminology and ontology are dealing with all the terms in a specific subject field or field of activity by describing the interrelationships between terms. So far, there are a lot of ontologies are created in the medical field, such as the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), the International Classification of Diseases (ICD), the Logical Observation Identifiers, Names, and Codes (LOINC), and RadLex. With SNOMED CT there is a comprehensive concept system for healthcare and it is becoming adopted as a standard terminology for electronic health records³: ICD is a health statistics coding tool for classification of human diseases, syndromes, and conditions. With LOINC the aim is to provide a means for uniquely identifying the information elements in electronic health records⁶. The first release of LOINC in May 1995 contained only terms for laboratory testing, and LOINC has grown significantly in other fields, including radiology, standardized survey instruments and patient-reported outcome measures, clinical documents, nursing management data, and nursing assessments⁷. RadLex, which is published by the Radiological Society of North America, aims to provide a comprehensive resource for image-related terms, spanning areas such as imaging technologies, image findings, anatomy, and pathology⁸.

Together with these, terminologies in radiological technology has been developed. The International Organization for Standardization (ISO) says that “health terminology is complex and multifaceted, more so than most other language fields,” and “it has been estimated that between 500,000 and 45 million different concepts are needed to adequately describe concepts⁹.” Japanese terminology regarding radiological technology published by the Japanese Society of Radiological Technology has less than 10,000 entries. It is necessary to add terms since there may not be enough terms and to maintain and update the body of recommended terminology continuously.

Developing and maintaining terminologies are difficult work¹⁰, here particularly as the language of medicine is in a constant flux¹¹. Therefore, computational support plays an important role in providing an efficient and accurate terminology expansion. The goal of this study is to edit and maintain the terminology of radiological technology automatically and efficiently. One of the important tasks is to identify synonyms automatically. The same concept may be described in different expressions in the text data. If this happens, a computer will determine that the word is different, and accurate processing will not be possible.

One useful method for automatic synonym detection is the distributed representation such as with Word2vec and fastText, and this is widely applied in the medical field¹². It expresses words in concepts using hundreds of vectors, and each concept is represented by the activity patterns of multiple vectors, and each vector is also referred to by multiple concepts.

The challenge in the study here is also to apply distributed representation methods to Japanese technical terms, but these methods were originally adopted to the latin alphabet. The Japanese language uses three sets of characters that each have specific grammatical uses—kanji, hiragana, and katakana. Kanji are ideograms which each has their own meanings and correspond to a word. Hiragana and katakana are native sets of syllable characters developed from kanji characters and are used only in Japanese. A Japanese word can be represented in a mixture of kanji, hiragana, and katakana, and it is sometimes shortened. Katakana is also frequently used in transliterations¹³. Moreover, alphabetic characters also found in medical documents as technical phrases and acronyms. For example, “MRI” can be represented in Kanji as "磁気共鳴画像(kakujikikyoumeigazou)" or as a combination of Kanji and abbreviation like “MR画像(MR-gazou).” The term "X-ray" can be translated in Japanese as “X線(x-sen),” which only incorporates the word "ray," as "エックス線(x-sen)" using a transliteration of katakana and kanji, or as "レントゲン線(Rentogen-sen)," named after its discoverer. Given this, synonymous terms in Japanese can differ based on script selection, loanword integration, and context. As a result, expressions in Japanese are intrinsically more intricate than those in Latinate languages. While research on synonym identification using distributed representations in Japanese exists^{14,15,16,17,18}, the nuances of complex expression patterns remain underexplored. It is, therefore, crucial to discern the advantages and limitations of distributed representations for managing terminology in radiological technology.

The purpose of this study was to evaluate the accuracy of automatic synonym identification in expression patterns using distributed representations in the field of radiological technology.

Methods

The methodology of this study encompassed several stages: data collection and preprocessing, the development of word embedding models, the formation of synonym sets, prediction of synonyms, and finally, evaluation, as illustrated in Fig. 1.

Data collection and preprocessing

We collected data from 337,479 abstracts (206 MB), published from 1980 to the data collection date (June 20, 2019), from the Ichushi-Web, a Japanese database¹⁹. Ichushi-Web archives approximately 400,000 periodicals published in Japan annually. This comprehensive collection includes journals from academic societies, medical publishers, university bulletins, and spans across disciplines such as medicine, dentistry, pharmacy, nursing, and their related fields. To date, the database boasts more than 15 million periodicals. When collecting data, we used “diagnosis imaging” as a keyword and descriptors which are synonyms of the keyword (Table1).

Table 1 List of descriptions.

Full size table

Our preprocessing phase involved several steps. First, we extracted the main text in the abstracts, and inserted spaces ahead of and behind the following symbols, such as brackets, colons, and semicolons. Next, we edited English technical terms consisting more of than 2 words. Specifically, we inserted underbars before, after, and between words appearing in RadLex 4.0²⁰ because a word is recognized by the space between words in distributed representations. Finally, we inserted spaces between words using MeCab²¹, which is a morphological analyzer, and the Mecab-ipadic-Neologd dictionary²² and the terminology for radiological technology^23,24. Figure 2 shows an example of the preprocessing.

Creation of word embedding models

We used two methods: Word2vec^25,26 and fastText²⁷. Word2vec uses the method of learning the context in which a particular word appears using a neural network. There are two architectures to produce a distributed representation of words: CBOW and skip-gram. CBOW is a method of predicting the current word from the surrounding context words; skip-gram is a method of learning the sequence of neighboring words based on the word and its pattern of appearance. fastText is an improved method with n-gram to solve the problem that Word2vec ignores sub-word information and out-of-vocabulary (OOV) words.

The parameters of Word2vec and fastText are shown in Table 2 and were were used to create distributed representations. The Gensim package for Word2vec²⁸, and the library of fastText²⁹ were used to create training vectors. All experiments for the training models were run on a computer with the Ubuntu 18.04 operating system, Intel Core i7-9700K and 64 GB RAM, with the programming language Python 3.8.3.

Table 2 Parameters in Word2vec and fastText.

Full size table

Synonym set creation

Two radiological technologists collaborated in the development of a synonym set for the evaluation dataset, utilizing terms pertinent to radiology and radiologic technology. Each set in this collection included a Japanese term along with its associated synonyms, which comprised either the corresponding English term and its acronym or the English term alone. This process yielded 1029 sets that became the foundation for this study. To conclude the process, we categorized each set based on fields relevant to the terms. The designated fields include image engineering, physical phenomena, radiation management, equipment, informatics, diagnostic imaging, radiation therapy, and basic medicine (Table 3). For the evaluation of synonym identifications considering expression patterns, we categorized this into seven patterns based on the features of the expression of synonyms, as follows: transliteration variants, different Japanese spellings with the same meaning, Japanese shortened forms, conversion to transliteration, English words, English acronyms, and plural expressions (Table 4).

Table 3 Eight fields in radiological technology and their descriptions.

Full size table

Table 4 Expression patterns in synonym sets.

Full size table

Synonym prediction

Subsequently, preferred terms from the synonym sets were input into the distributed representations. The cosine similarities were then calculated, and the top 10 synonym candidates with the highest cosine similarities to the input term were obtained. The formulation of the cosine similarity was presented as follows,

$${\text{cosine}} \, \mathrm{ similarity}= \frac{{\text{A}}\cdot {\text{B}}}{\Vert {\text{A}}\Vert \Vert {\text{B}}\Vert }$$

(1)

where A is a vector of an input word and B is a vector of a word in a distributed representation.

Evaluation

The ten synonym candidates procured were juxtaposed against the synonyms delineated in the set. If the orthography of a synonym aligned with that of a candidate, said candidate was adjudged correct. Metrics such as precision, recall, F1-measure, and accuracy were subsequently computed for each word embedding model as follows.

$${\text{Precision}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FP}}} \right)$$

(2)

$${\text{Recall}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right)$$

(3)

$${\text{F1}} - {\text{score}} = {2} \times {\text{Precision}} \times {\text{Recall}}/\left( {{\text{Precision}} + {\text{Recall}}} \right)$$

(4)

$${\text{Accuracy}} = \left( {{\text{TP}} + {\text{TN}}} \right)/\left( {{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}} \right)$$

(5)

Where TP stands for true positive and refers to situations wherein a word is correctly identified as a synonym. FP denotes false positive and indicates instances wherein a word is erroneously identified as a synonym when, in fact, it is not. FN signifies false negative and pertains to situations in which a word is inaccurately identified as not being a synonym by the models. TN refers to true negative and represents cases wherein a word is correctly identified as not being a synonym. However, TN was set to 0 in this study due to the stipulation that only synonymous candidate words would be output in this task.

For each model, cumulative accuracies across various ranks were derived by consolidating individual prediction accuracies within the model using the formula in Eq. (5). Moreover, to gauge the accuracy of synonym identification contingent upon expression patterns, we computed the metrics of precision, recall, F1-score, and accuracy over eight specific fields in radiological technology and different expression patterns in the synonym sets. In the analysis of English terms and abbreviations, the output was also assessed based on words consisting solely of alphabetical characters.

Ethical approval

This article does not include any studies with human participants or animals that were performed by any of the authors.

Results

Comparison among models

In comparing the evaluation indices of synonym predictions across each word embedding model, the model that achieved the highest precision, recall, F1-score and accuracy was fastText with CBOW with 300 vector dimensions registering scores of 0.5567, 0.8872, 0.6841, 0.5199, respectively (Table 5). In all evaluation indices, the observed values indicated that accuracies in fastText consistently outperformed those of Word2vec. The observed trends for CBOW were superior to those of skip-gram. However, in fastText with CBOW, the 100-dimensional representation yielded the lowest performance. There was a general trend of improved performance with increasing dimensions in both fastText with skip-gram and Word2vec with CBOW. Conversely, in Word2vec with skip-gram, performance declined as dimensionality decreased.

Table 5 Cumulative ratio of synonym prediction in each word embedding model. (FT: fastText, W2V:Word2vec, SKIP:skip-gram, number: vector dimensions).

Full size table

Figure 3 illustrates the cumulative probabilities observed in each model. In the context of the Word2vec with the CBOW model, the average variation in cumulative ratios for the foremost 1–10 words was observed to be approximately 14.8% to 19.5%. In contrast, the fastText with CBOW model exhibited a range of about 24.1% to 29.5%, and the fastText with skip-gram model displayed a span of 19.0% to 28.2%. "It is noteworthy that in the fastText model, there was a marked enhancement in cumulative accuracy when incorporating terms that appeared in lower ranks. Table 6 shows examples of synonym candidates (Top-10) using the most accurate model in each architecture.

Table 6 Examples of the Top-10 synonym candidates in the models with the greatest accuracy.

Full size table

Analysis for the synonym expression patterns in Japanese

Table 7 presents the results for synonym expression patterns in Japanese. In the “different Japanese spellings with the same meaning,” almost all fastText models demonstrated superior performance over word2vec across all indices. The fastText model employing CBOW at 400 dimensions reported precision, recall, F1-score, and accuracy values of 0.5365, 0.6649, 0.5938, and 0.4222, respectively.

Table 7 Results for synonym expression patterns in Japanese.(FT: fastText, W2V:Word2vec, SKIP:skip-gram, number: vector dimensions).

Full size table

In the “Japanese shortened forms,” the indices for fastText notably surpassed those of word2vec. The optimal performance was observed in fastText with CBOW at 900 dimensions, yielding precision, recall, F1-score, and accuracy values of 0.9038, 0.8952, 0.9000, and 0.8174, respectively. Furthermore, there was a discernible enhancement in performance as the dimension number increased.

In “conversion to transliteration,” word2vec models with CBOW outshined other models in all indices. Specifically, the model at 400 dimensions achieved a precision of 0.328, a recall of 0.625, an F1-score of 0.430, and an accuracy of 0.274.

In “transliteration variants,” fastText models consistently excelled over those of word2vec. The skip-gram approach at both 200 and 400 dimensions exhibited standout performance, registering precision, recall, F1-score, and accuracy values of 0.833, 0.935, 0.882, and 0.789, respectively.

In the category of “plural expressions,” fastText models significantly outperformed Word2vec models, particularly excelling in precision, recall, and overall accuracy. The fastText model employing a 300-dimensional CBOW architecture emerged as the most effective. Figure 4 presents the distribution of the most prominent synonym expression patterns for plural expressions, as analyzed across various models. For categories such as “transliteration variants,” “different Japanese spellings with the same meaning,” and “Japanese shortened forms,” optimal performance was achieved within the 200 to 400 dimensional range using fastText with CBOW. In contrast, other notation patterns showed a tendency towards Word2vec, with “conversion to transliteration” peaking with a 500-dimensional CBOW model.

Analysis for the synonym expression patterns in English terms and abbreviations

Table 8 shows the results for English terms and abbreviations. In “Japanese and English terms,” all indices were poor for all models. The best F1-score and accuracy were 0.1836 and 0.1011 for word2vec with CBOW and 800 dimensions. In Output alphabetic-only words, the values of all indicators improved, and the improvement was particularly pronounced in fastText. The best model was fastText with skipgram at 800 and 900 dimensions, with precision, recall, F1-score and accuracy of 0.3253, 0.3718, 0.3470 and 0.2099, respectively. The rate of increase ranged from 0.15 to 0.25 for all indices.

Table 8 Results for synonym expression patterns in English terms and acronyms.(FT: fastText, W2V:Word2vec, SKIP:skip-gram, number: vector dimensions).

Full size table

In "Japanese words and English acronyms," fastText with skipgram performed well on all indicators. In particular, the 400-dimensional model showed the best values for precision, F1-score, and accuracy at 0.4643, 0.6341, and 0.4643, respectively, For fastText with CBOW at 400 dimensions, these indices increased by about 0.25, while for the other fastText with skipgram, they increased by about 0.11 to 0.13.

Analysis for eight fields in radiological technology

In the evaluative indices, the fields of “Image Engineering”, “Physical Phenomena”, “Equipment”, “Radiation Therapy”, “Medicine”, and “Imaging Diagnosis” manifested optimal values when processed with fastText with CBOW. In these fields, the optimal vector dimensionality was consistently below 500, gravitating towards approximately 300 dimensions. Specifically, within the realm of “Imaging Diagnosis”, it emerged as the most superior among all categories. The recorded values for precision, recall, F1-score, and accuracy were 0.7137, 0.9586, 0.8182, and 0.6923, respectively. The "Radiation Control" showed optimal performance when processed in fastText using the skip-gram approach at 600 dimensions. The "Informatics" showed the best values in all evaluation metrics when subjected to Word2vec with CBOW at 800 dimensions. In the investigation of Japanese notation patterns, it was observed that plural expressions exceeded 30% across all domains apart from Radiation Therapy. In the fields of Equipment, Informatics, and Radiation Therapy, the frequency of “conversion to transliteration” was notably higher, accounting for approximately 20% or more in comparison to other categories. “Japanese shortened forms” were prevalent, constituting over 20% within the domains of Medicine and Imaging Diagnosis. Moreover, “Different Japanese spellings with the same meaning” appeared most frequently in the areas of Physical Phenomena, Radiation Control, and Physical Phenomena again, exceeding 40% (Table 9).

Table 9 The percentage of each Japanese expression pattern in the eight fields.

Full size table

Discussion

Comparison between Word2vec and fastText

Across all indices, fastText consistently outperformed Word2vec. Notably, fastText employing the CBOW architecture peaked in performance at 300 dimensions. Furthermore, scores remained relatively stable across various vector dimensions within fastText with CBOW. The disparity between the maximum and minimum values was a mere 0.03, translating to a difference of about 30 words attributable to variations in the number of vector dimensions. In the context of Word2vec, models crafted using the skip-gram approach outpaced those based on CBOW, with the exception of the 100-dimensional representation. Given the outcomes, it is evident that the most effective architecture for synonym extraction within the domain of radiation technology is the fastText model utilizing the CBOW approach, with vector dimensions ranging between 300 and 400. However, considering the tendency for synonyms to appear in lower ranks, effective automation would require extracting terms from multiple ranks and implementing a robust filtering mechanism to refine the synonym selection.

Analysis for the synonym expression patterns

In the evaluation of seven expression patterns, fastText outperformed word2vec in four categories: “transliteration variants,” “different Japanese forms with the same pronunciation and meaning,” “Japanese shortened forms” and “plural expressions.” A common feature of these four categories of synonym sets was the few differences of the number of characters in words in sets. Expressed differently, synonyms in these categories had a common n-gram. Figure 5 shows the distribution of word vectors by t-distributed stochastic neighbor embedding (t-SNE) which is an unsupervised dimension reduction technique³⁰. Here we will focus on the sets of “Japanese shortened forms”. In fastText, terms with the same n-gram tend to be located closer. Clusters of words including those with the same n-gram tends to spread widely and overlap in Word2vec. This result also suggests that fastText is advantageous in detecting synonyms with the same n-gram in high rankings.

The optimal architecture (CBOW and skip-gram) and the number of vectors in these four categories differed depending on the categories of the synonym sets. For the architecture, skip-gram was adequate only for "different Japanese forms with the same pronunciation and meaning". In fastText and CBOW the tendency is to use words containing a common n-gram as synonyms, and skip-gram tended to extract synonyms pairs that did not have a common n-gram. The more sets that include a common n-gram, the more CBOW would be advantageous, and if not, it may be equivalent to skip-gram or skip-gram may be advantageous. For vector dimensions, the ratio was improved or did not change significantly when the dimension number was larger. As mentioned in previous studies^16,28, it has been reported that the accuracy improves as the vector dimensions increases. However, the optimum vector size may change depending on the characteristics of the synonym sets. This matter is also a subject for future investigations.

In “conversion to transliteration”, “Japanese words and English acronyms” and “Japanese and English words”, four indices in word2vec were equivalent or better when compared to the models by fastText, depending on the number of vector dimensions. However, the accuracy was the highest, with 50% in “Japanese words and English acronyms” and less than 30% in the others, which was inferior to that in the above four categories. Synonym sets in these categories had few or no common character strings, and it is difficult for fastText to perform well, making it necessary to consider ways to improve the accuracy of Word2vec.

It has been observed that when generating outputs involving English words and abbreviations, Japanese words tend not to rank at the top. This phenomenon is likely attributable to the models being predominantly trained on a Japanese corpus, thereby limiting their exposure to sufficient English expressions. Notably, the accuracy for both English words and abbreviations improved significantly when the models were constrained to output only alphabet characters. This finding suggests that specifying character sets can be a beneficial strategy when aiming to generate outputs in a language different from that of the training corpus.

Synonym expression patterns

In the domains of “Image Engineering,” “Physical Phenomena,” “Equipment,” “Radiation Therapy,” “Medicine,” and “Imaging Diagnosis,” optimal outcomes were observed when employing FastText with the CBOW model. It was noted that in these fields, synonyms commonly included words with shared n-grams. Conversely, in the areas of Radiation Control and Imaging Diagnosis, the fastText model utilizing Skip-grams was favored. This preference could be attributed to the model”s enhanced capability in detecting words that posed challenges for the fastText model with CBOW, thereby potentially contributing to the improved performance in evaluation indices. In the field of “Informatics,” a considerable number of synonyms were categorized under “different Japanese writings of the same meaning” and “Japanese word and Japanese transliteration.” These categories were distinguished by a markedly lower frequency of shared n-grams. In scenarios characterized by a reduced prevalence of common n-grams among synonyms, the Word2vec model might exhibit a comparative advantage over fastText.

Comparison with previous studies

When contrasting the skip-gram and CBOW architectures within Word2vec, several studies have indicated a superior performance by CBOW in tasks like similar word detection and text classification^31,32. In the context of fastText, existing research has posited that skip-gram surpasses CBOW, especially in sentiment analysis-based classifications^33,34. The outcomes from our research align with these findings for Word2vec. In our study, however, the difference in the cumulative ratio between the most accurate model (CBOW) and skip-gram was marginal, at approximately 1.9%, as detailed in Table 4. This marginal difference could hint at an inherent advantage for skip-gram depending on the nature of the task at hand.

Regarding the dimensionality of word embeddings, various studies have explored the relationship between accuracy and vector dimensions. For instance, Milolov et al. described that the accuracy increased as vector size increased²⁶. Concurrently, Pennington et al. highlighted a peak in accuracy around the 300-dimensional mark³⁵. Our results resonate with these observations, underscoring the general trend seen in word embedding research. However, it’s worth noting that these prior studies didn’t exclusively target medical terminology and differed in their specific tasks compared to our research.

Limitations

The synonym set employed in this study was curated by two experts drawing from a glossary provided by academic societies. However, this approach is not impervious to potential omissions. Additionally, as part of the preprocessing, spaces were introduced between words in Japanese text using the morphological analysis tool. A significant challenge in morphological analysis is the handling of out-of-vocabulary (OOV) words. When these OOV words, not covered in the dictionary, are subjected to analysis, there is a risk of incorrect segmentation. This becomes especially challenging when the OOV (Out of Vocabulary) word is a technical term, as it may hinder the successful identification of synonymous candidates. Due to the collected text in the learning corpus primarily focusing on “image diagnosis”, there is a possibility that adequate learning has not been achieved for areas with low relevance, such as radiation measurement.

Conclusions

The application of Word2vec and fastText models for automatic synonym detection in the field of radiological technology indicated that the fastText with CBOW at 300 dimensions was the most precise. In the detailed analysis of synonym notation patterns, it was found that fastText with CBOW excelled in cases where synonyms shared common n-grams. Conversely, fastText with skip-gram and Word2vec with CBOW models were more effective in instances where synonyms did not share common n-grams. In the eight fields pertinent to radiological technology, the fastText with CBOW model proved particularly beneficial due to the frequent occurrence of common n-grams. However, in the field of informatics, where English terms, acronyms, and transliterations are commonly employed, the Word2vec model with CBOW architecture showed greater utility.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

References

Cimino, J. J. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf. Med. 37(4–5), 394–403 (1998).
CAS PubMed PubMed Central Google Scholar
Wang, K. C. Standard lexicons, coding systems and ontologies for interoperability and semantic computation in imaging. J. Digit. Imaging 31(3), 353–360 (2018).
Article PubMed PubMed Central Google Scholar
Bodenreider, O. Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb. Med. Inform. 17, 67–79 (2008).
Article Google Scholar
de González Bernaldo de Quirós, F., Otero, C. & Luna, D. Terminology services: Standard terminologies to control health vocabulary. Yearb. Med. Inform. 27(1), 227–233 (2018).
Article PubMed PubMed Central Google Scholar
Corwin, M. T. et al. Nonstandardized terminology to describe focal liver lesions in patients at risk for hepatocellular carcinoma: implications regarding clinical communication. AJR Am. J. Roentgenol. 210(1), 85–90. https://doi.org/10.2214/AJR.17.18416 (2018).
Article PubMed Google Scholar
Cornet, R. & Chute, C. G. Health concept and knowledge management: Twenty-five years of evolution. Yearb. Med. Inform. 25(Suppl 1), S32-41. https://doi.org/10.15265/IYS-2016-s037 (2016).
Article Google Scholar
Bodenreider, O., Cornet, R. & Vreeman, D. J. Recent developments in clinical terminologies - SNOMED CT, LOINC, and RxNorm. Yearb. Med. Inform. 27(1), 129–139 (2018).
Article PubMed PubMed Central Google Scholar
Langlotz, C. P. RadLex: A new method for indexing online educational materials. Radiographics 26(6), 1595–7 (2006).
Article PubMed Google Scholar
ISO 17115:2007. Health informatics - Vocabulary for terminological systems. International Organization for Standardization (ISO). https://www.iso.org/obp/ui/#iso:std:iso:17115:ed-1:en. Accessed 19 May 2021.
Rector, A. L. Clinical terminology: Why is it so hard?. Methods Inf. Med. 38(4–5), 239–252 (1999).
CAS PubMed Google Scholar
Smith, B. From concepts to clinical reality: an essay on the benchmarking of biomedical terminologies. J. Biomed. Inform. 39(3), 288–98. https://doi.org/10.1016/j.jbi.2005.09.005 (2006).
Article PubMed Google Scholar
Kalyan, K. S. & Sangeetha, S. SECNLP: A survey of embeddings in clinical natural language processing. J. Biomed. Inform. 101, 103323. https://doi.org/10.1016/j.jbi.2019.103323 (2020).
Article PubMed Google Scholar
Bilac, S, & Tanaka, H. A hybrid back-transliteration system for Japanese. In Proceedings of The 20th International Conference on Computational Linguisics, COLING2004 597–603 (2004).
Yagahara, A., Uesugi, M. & Yokoi, H. Identification of synonyms using definition similarities in Japanese medical device adverse event terminology. Appl. Sci. 11(8), 3659. https://doi.org/10.3390/app11083659 (2021).
Article CAS Google Scholar
Joko, H., Matsuda, Y. & Yamaguchi, K. Automatic synonym acquisition using a context-restricted skip-gram model. J. Nat. Lang. Process. 24(2), 187–204 (2017).
Article Google Scholar
Hirabayashi, T., Komiya, K., Asahara, M., & Shinnou, H. Composing word vectors for japanese compound words using bilingual word embeddings. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, 2020 404–410 Hanoi, Vietnam. Association for Computational Linguistics.
Karpinska, M., Li, B., Rogers, A., & Drozd, A. Subcharacter information in Japanese embeddings: When is it worth it? In Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP 28–37, Melbourne, Australia. Association for Computational Linguistics (2018).
Andrade, D., Tsuchida, M., Onishi, T., & Ishikawa, K. Synonym Acquisition Using Bilingual Comparable Corpora. Computer Science. IJCNLP2013.
Japan Medical Abstracts Society. Ichushi-Web. https://search.jamas.or.jp/search. Accessed 6 Nov 2023.
BioPortal. Radiology Lexicon. https://bioportal.bioontology.org/ontologies/RADLEX. Accessed 6 Nov 2023.
Kudo, T., Yamamoto, K., Matsumoto, Y. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2004 Presented at 2004 Conference on Empirical Methods in Natural Language Processing; 230–237 July 25–26, Barcelona, Spain (2004).
GitHub. Mecab-ipadic-Neologd. https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md. Accessed 19 May 2021.
Japanese Society of Radiological Technology. Terminology for Radiological Technology (Japanese Society of Radiological Technology, 1994).
Google Scholar
Japanese Society of Radiological Technology. Terminology for Radiological Technology-Supplement (Japanese Society of Radiological Technology, 2003).
Google Scholar
Mikolov, T., Chen, K., & Corrado, G. S., et al. Efficient estimation of word representations in vector space. ArXiv 2013: ArXiv:13013781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (eds Burges, C. J. C. et al.) 3111–3119 (Curran Associates, Inc., 2013).
Google Scholar
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017).
Article Google Scholar
Rehurek, R., & Sojka, P. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2) (2011).
GitHub.fastText. https://github.com/facebookresearch/fastText. Accessed 19 May 2021.
Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Jin, L., & Schuler, W. A comparison of word similarity performance using explanatory and non-explanatory texts. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 990–994 May–June; Denver, Colorado (2015).
Jang, B., Kim, I. & Kim, J. W. Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE 14(8), e0220976. https://doi.org/10.1371/journal.pone.0220976 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ombabi, A. H., Ouarda, W. & Alimi, A. M. Deep learning CNN–LSTM framework for Arabic sentiment analysis using textual information shared in social networks. Soc. Netw. Anal. Min. 10, 53. https://doi.org/10.1007/s13278-020-00668-1 (2020).
Article Google Scholar
Chowdhury, H. A., Imon, A. H., & Islam, S. A comparative analysis of word embedding representations in authorship attribution of Bengali literature. In 21st International Conference of Computer and Information Technology 1–6. https://doi.org/10.1109/ICCITECHN.2018.8631977 (2018).
Pennington, J., Socher, R., & Manning, C. D. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing 1532–1543 (2014).

Download references

Funding

This work was supported by JSPS KAKENHI 20K18857.

Author information

Authors and Affiliations

Faculty of Health Sciences, Hokkaido University of Science, 7-Jo 15-4-1 Maeda, Teine, Sapporo, Hokkaido, 006-8585, Japan
Ayako Yagahara
National Institute of Information and Communications Technology, Osaka, Japan
Noriya Yokohama

Authors

Ayako Yagahara
View author publications
You can also search for this author in PubMed Google Scholar
Noriya Yokohama
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.Y.: Conceptualization, Investigation, Methodology, Software, Data curation, Writing - Original Draft, Funding acquisition, Project administration. Noriya Yokohama: Software, Validation, Writing––review and editing.

Corresponding author

Correspondence to Ayako Yagahara.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yagahara, A., Yokohama, N. Comparison of the accuracy of Japanese synonym identifications using word embeddings in the radiological technology field. Sci Rep 13, 22408 (2023). https://doi.org/10.1038/s41598-023-49708-8

Download citation

Received: 26 February 2023
Accepted: 11 December 2023
Published: 16 December 2023
DOI: https://doi.org/10.1038/s41598-023-49708-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

BioWordVec, improving biomedical word embeddings with subword information and MeSH

A deep database of medical abbreviations and acronyms for natural language processing

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Methods

Data collection and preprocessing

Creation of word embedding models

Synonym set creation

Synonym prediction

Evaluation

Ethical approval

Results

Comparison among models

Analysis for the synonym expression patterns in Japanese

Analysis for the synonym expression patterns in English terms and abbreviations

Analysis for eight fields in radiological technology

Discussion

Comparison between Word2vec and fastText

Analysis for the synonym expression patterns

Synonym expression patterns

Comparison with previous studies

Limitations

Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links