Oral manifestations in patients with coronavirus disease 2019 (COVID-19) identified using text mining: an observational study

Text mining enables search, extraction, categorisation and information visualisation. This study aimed to identify oral manifestations in patients with COVID-19 using text mining to facilitate extracting relevant clinical information from a large set of publications. A list of publications from the open-access COVID-19 Open Research Dataset was downloaded using keywords related to oral health and dentistry. A total of 694,366 documents were retrieved. Filtering the articles using text mining yielded 1,554 oral health/dentistry papers. The list of articles was classified into five topics after applying a Latent Dirichlet Allocation (LDA) model. This classification was compared to the author's classification which yielded 17 categories. After a full-text review of articles in the category “Oral manifestations in patients with COVID-19”, eight papers were selected to extract data. The most frequent oral manifestations were xerostomia (n = 405, 17.8%) and mouth pain or swelling (n = 289, 12.7%). These oral manifestations in patients with COVID-19 must be considered with other symptoms to diminish the risk of dentist-patient infection.

Oral manifestations in patients with coronavirus disease 2019 (COVID-19) identified using text mining: an observational study Sandra Guauque-Olarte 1* , Laura Cifuentes-C 2 & Cristian Fong 3 Text mining enables search, extraction, categorisation and information visualisation.This study aimed to identify oral manifestations in patients with COVID-19 using text mining to facilitate extracting relevant clinical information from a large set of publications.A list of publications from the openaccess COVID-19 Open Research Dataset was downloaded using keywords related to oral health and dentistry.A total of 694,366 documents were retrieved.Filtering the articles using text mining yielded 1,554 oral health/dentistry papers.The list of articles was classified into five topics after applying a Latent Dirichlet Allocation (LDA) model.This classification was compared to the author's classification which yielded 17 categories.After a full-text review of articles in the category "Oral manifestations in patients with COVID-19", eight papers were selected to extract data.The most frequent oral manifestations were xerostomia (n = 405, 17.8%) and mouth pain or swelling (n = 289, 12.7%).These oral manifestations in patients with COVID-19 must be considered with other symptoms to diminish the risk of dentist-patient infection.
Between December 2019 and May 2023, a total of 362,313 publications related to COVID-19 were published (PubMed).Data mining enables extracting information from large amounts of text to be analysed using text mining methods.Text mining facilitates information extraction, categorisation, grouping, trend analysis and visualisation 7 .The goal is to focus information searches to remove noise 8 and to identify hidden knowledge in the literature 9 .Zengul et al. used text mining to classify literature from the NIH COVID-19 Portfolio.They found several topics of COVID-19 research, such as patient care and outcomes, epidemiologic modelling, mental health and detection 10 .Tandan et al. used data mining to identify the common pattern of symptoms in patients with COVID-19 7 .Reddy et al. developed a biomedical platform to collect data on COVID-19 clinical risks.The use of this platform revealed the difficulty of extracting relevant clinical information through text mining and the need for feedback from experts in the field to obtain reliable results 9 .
Therefore, this study aimed to identify oral manifestations in patients with COVID-19 using text mining to facilitate extracting relevant clinical information from a large set of publications.Our study identified frequent

Curation of the list of articles
To extract from the text file only articles related to COVID-19 and oral health/dentistry, we used the "Mining COVID-19 scientific paper" notebook with modifications, which is available on Kaggle (https:// www.kaggle.com/ mobas sir/ mining-covid-19-scien tific-papers).The text file was uploaded to the notebook.Stop words (commonly used words, usually articles, that search engines are programmed to ignore) were removed from the abstract sentences.Stop words commonly found in scientific publications (such as "objective", "methods" and "conclusions") were also eliminated.Using the Gensim Python library, a 5-g (five-word combination) was created 11, and 494 "exclusion keywords" were identified.After visually scanning that the abstracts containing those "exclusion keywords" were not relevant, Python code was used to exclude them from the text file.

Classification of articles
The next step of the text mining analysis was to use a Latent Dirichlet Allocation (LDA) model to the abstracts to classify the list of articles curated into topics.LDA is a probabilistic dimensionality reduction technique that can classify a document into two or more mutually exclusive classes or topics based on the frequency of the document's words.A distribution of words characterises each topic.The topic probabilities provide an explicit representation of a document 12 .The model assigned a dominant topic to each document and its percentage contribution,thus, a corpus of documents can be classified into topics depending on their subject.

Author´s review of documents
To determine if the text mining process could classify the articles in the same categories that an expert would do, the authors reviewed each article title, abstract and excerpt in the curated text file to classify them.Finally, one category ("Oral manifestations in patients with COVID-19") was selected for full-text review, and a database was created to summarise the results.The study workflow is summarised in Fig. 1.

Ethics approval
Due to the nature of the study based on text mining, Ethical Approval was not required.

Results
Between January 1, 2020, to December 31, 2021, the CORD-19 dataset contained 694,366 documents.Only 1,554 were oral health/dentistry related, even though the oral cavity is an entrance and reservoir of the SARS-CoV-2 13,14 and the virus causes oral lesions [15][16][17] .The main subjects of the documents retrieved cover biosafety and changes in dental practice during the pandemic, the psychological implications for dentists and patients and the affectation of dental education during the COVID-19 pandemic.
Before downloading data and using Python commands, we removed 163,685 preprints, 361,713 duplicate documents based on title and 406 duplicates based on the abstract.Zero records had missing abstract.A keyword search was performed on the remaining 168,562 documents, yielding 5,075 articles that were downloaded: 42 for "buccal", 1,294 for "dental", 433 for "dentistry", 797 for "dentist", 17 for "odontology", 2,489 for "oral" and 3 for "stomatognathic".The remaining articles were written in English, French and Spanish and were included in further analysis.A file containing the 5,075 documents is available in Supplementary Table 1.
After the first run of text mining applying the LDA model, 1,554 papers related to oral health/dentistry were classified into five topics (Fig. 2).The topics 0, 1, 2, 3 and 4 contained 68, 826, 185, 242 and 233 papers, respectively.The size of topic 1 remains between 700 and 800 documents in successive runs of the LDA model adjusted to generate 8 or 10 topics.Therefore, the initial classification of five topics was kept.
Table 1 shows the number of papers classified by topic and each topic's top 10 words or descriptive terms.According to the top 10 words, the five topic subjects were (0) the oral cavity as an entrance for the coronavirus into the body, (1) dental clinical practice and services offered during the pandemic, (2) the risk of COVID-19 transmission during clinical practice, (3) the psychological implications of COVID-19 for a dentist and the impact of COVID-19 in dental schools and education, (4) oral manifestations of COVID-19, oral hygiene and oral cancer.A multidimensional scaling analysis shows that topics 1 and 3 were the most similar.

Agreement between text mining and author's review
The first round of text mining resulted in 1,554 papers that the authors reviewed by title, abstract and excerpt.The authors classified the papers into 17 categories based on the content (Table 2).Therefore, the five topics generated by the LDA model during the text mining analysis were divided into more specific subjects.The main categories obtained by the author's review were "Biosafety in dental practice" (n = 186, 17.4%) and "Dental practice during the pandemic" (n = 182, 17.0%).
Finally, the full text of 21 papers in the category "Oral manifestations in patients with COVID-19" was reviewed due to its clinical relevancy.This category is part of topic 4. A database was created to summarise the results of eight studies that passed the full-text review and contain the frequency of the oral manifestations identified in patients with COVID-19 (Table 3).In these eight studies, the most common oral manifestations were xerostomia (patient's sensation of dry mouth) reported in six studies, burning mouth and mouth pain or swelling reported in five studies, impaired taste and ulcers in four studies, and dysphagia in three studies (Supplementary Table 2).The detection methods of the oral manifestations were clinical staff diagnosis and self-diagnosis.The two most frequent oral manifestations diagnosed by clinical staff were salivary gland ectasia (n = 46, 2.0%) and U-shaped lingual papillitis (n = 35, 1.5%).The most frequent oral manifestations identified by auto-diagnosis were xerostomia (n = 389, 17.1%) and mouth pain or swelling (n = 265, 11.7%) (Table 4).

Discussion
The COVID-19 pandemic has profoundly impacted society, not only on a health level but also on social aspects such as economics and education.Lockdown measures to contain the COVID-19 spread led to the closure of businesses, schools, commercial stores, and government offices.The practice and teaching of Odontology have www.nature.com/scientificreports/been affected during the pandemic due to the high risk of exposure to SARS-CoV-2 [18][19][20] .Cancellation of most dental appointments during the pandemic, except for emergencies, and the astringent safety measures that odontologists must have taken to resume patient care during economic reactivation affected odontology worldwide.Virtual classes, fewer patients in clinics, and the inability to conduct research with patients were challenges for dental schools 21 .Due to the clinical relevancy, we selected the topic "Oral manifestations in patients with COVID-19", among 17 categories, for a deeper analysis.The two most common oral alterations in patients with COVID-19 diagnosed www.nature.com/scientificreports/by clinical staff were salivary gland ectasia and U-shaped lingual papillitis.The main oral manifestations identified by auto-diagnosis were xerostomia and mouth pain or swelling.Early infection of the salivary glands by SARS-CoV-2 may be associated with salivary gland ectasia.This hyperinflammation of the salivary gland is significantly related to protein C-reactive and LDH levels, both of which are COVID-19 severity biomarkers 22 .
Patients' evaluation has shown that salivary gland ectasia is associated with a more severe course of COVID-19 23 .U-shaped lingual papillitis is an inflammation of the tongue's papillae.It could be caused by direct inflammation or drying of the oral mucosa or poor oral health 24 .
The presence of xerostomia in patients with COVID-19 has been reported previously [25][26][27] .Xerostomia is a sign of dehydration, which can occur secondary to infections 28 .Early infection of the salivary glands by the SARS-CoV-2, affecting their function and leading to changes in the flow and composition of saliva, is one possible cause  The data mining strategies, such as the one applied here, help analyse a vast amount of information; however, they require a human understanding of the keyword combinations and information printed by the method to avoid missing relevant documents or including thousands of irrelevant articles to reach the aim of the study.
The identification of 17 categories or subjects reflects the variety of themes covered within a single paper; for example, an article can focus on biosafety, psychological repercussions, and epidemiology simultaneously.When an article contains more than one paper, the LDA model assigns the document to more than one topic, although one of these topics may dominate the others.Therefore, the topics can overlap as topics 1 and 3 do in the present study.Another consequence of the subject variety within a paper is that each topic obtained by the model included more than one subject.For example, topic 4 was related to oral manifestations of COVID-19, oral hygiene and oral cancer.Zengul et al. (2021) use a text mining approach to classify the NIH COVID-19 Portfolio as of November 2021 based on abstracts and titles.They identified 11 major research areas (topics), including "Epidemiologic Modelling", "Mechanism of Disease", "Protection/Prevention", "Mental/Behavioural Health" and "Detection/ Testing".Additionally, they found that only five of the 11 abstract-based topics had a significant correlation with title-based topics and recommended revising the use of titles as the first step in developing an evidence-based medicine analysis portfolio 10 .None of the topics covered by Zengul et al. was related to oral health or dentistry.
Our analysis also reveals a challenge in selecting and classifying papers based on titles, abstracts, or excerpts, as it is common in systematic reviews and evidence-based medicine/dentistry.To build the corpus of text mining analysis in dentistry, we suggest using keywords or MeSH terms as the first filter of articles.The limitations of this study included a unique source of information; the literature analysis was based on the COVID-19 Open Research Dataset (CORD-19) database.
A perspective of this study is to apply text mining tools to identify the keywords that distinguished the 17 categories generated by the "author's review" and use those keywords as the basis for applying machine learning algorithms to construct a predictive model that can improve the selection and classification of the papers, reducing the time and necessity of curation of the documents.

Conclusions
Using text mining to identify oral health/dentistry-related articles in a large dataset of COVID-19 publications, we found that patients with COVID-19 experienced a wide range of oral changes.The most common oral manifestations were those affecting the salivary glands, such as xerostomia or salivary gland ectasia, followed by oral pain and inflammation.
Text mining is a helpful tool for analysing and sorting massive document datasets.However, it was necessary to combine text mining with the author´s review by title, abstract and excerpts to avoid the loss of data or the inclusion of unnecessary information.

Figure 1 .
Figure 1.Workflow of the analysis.The documents available in the COVID-19 Open Research Dataset (CORD-19) were filtered using the "Summary page COVID-19 risk factors" and the "Mining COVID-19 scientific paper" notebooks available on Kaggle (https:// www.kaggle.com/ mlcon sult/ summa ry-page-covid-19-risk-facto rs and https:// www.kaggle.com/ mobas sir/ mining-covid-19-scien tific-papers, respectively).The text mining classification of the documents was compared to the author's classification of the list based on title, abstract and excerpt.Finally, eight papers in the category "Oral manifestations in patients with COVID-19" were selected after full-text review to extract data.

Figure 2 .
Figure 2. Word clouds of the five topics obtained after applying the Latent Dirichlet Allocation (LDA model).The font of the 30 more often words per topic reflects the decreasing frequency of appearance of each word.
To make COVID-19-related literature more accessible, biomedical literature databases such as the open-access COVID-19 Open Research Dataset (CORD-19) 6 and the NIH COVID-19 Portfolio (https:// icite.od.nih.gov/ covid 19/ search/) were created.The Allen Institute for Artificial Intelligence created the CORD-19 in collaboration with the National Institutes of Health and the White House Office of Science and Technology Policy, among others.In the CORD-19, the PDF documents are converted into machine-readable JSON files that can be manipulated using programming languages.CORD-19 had 694,366 records by December 2021.The CORD-19 final release was on June 2, 2022.

Table 1 .
Top 10 keywords or descriptive terms per topic and number of articles per topic., symptom, positive, test, case, group, compare, day, hcw (healthcare workers) 233(14.99)

Table 2 .
Description of the 17 categories obtained after the authors reviewed the title, abstract and excerpt.The count and frequency of articles per category are shown.

Table 3 .
Characteristics of the included studies and the population evaluated.