Introduction

With the increasingly widespread use of electronic medical records (EMRs), a vast number of medical records are being generated daily and stored electronically. The increase in the amount of accumulated data has brought about investigations for various secondary uses of data1, which may be structured or unstructured. Compared with analyses of structured data, analyses of unstructured data, such as narrative text, have been reported to allow for the more accurate extraction of symptoms expressed by patients2; therefore, their use has been recommended3. However, analyzing narrative text manually not only is labor-intensive, but also can lead to variable results; therefore, natural language processing (NLP) using a computer is indispensable4. NLP enables an immense amount of narrative text to be analyzed in a timely and low-cost manner, in addition to the extraction of invariable information1. As the use of NLP in medical care progresses, there are high expectations for the realization of quality improvement in medical care and more efficient clinical research5. This is also the case in all digestive system fields, including gastrointestinal disease, liver disease, and biliary and pancreatic disease, which would benefit from easier access to clinical information1; however, collaboration between NLP and medical experts who can provide domain knowledge is essential to make this practical6.

Extracting information such as disease names, symptom names, and adverse events from narrative text in medical documents is one of the goals of NLP research7,8, and it is similar in the Japanese domain9,10,11,12,13. In recent years, these information extraction techniques have been applied to the development of various predictive models using machine learning, but the risk of misusing the outputs of the system has been pointed out in the process14,15,16. In particular, Hovy et al.17 warned that the characteristics of the data to be analyzed are one of the causes of bias and lack of fairness that occur with the use of NLP systems. To promote the correct utilization of NLP in medical care, it is therefore essential to gain a better understanding of the characteristics of various medical documents stored in EMRs from the perspective of NLP.

Various types of medical documents written using narrative text are included in EMRs, such as admission notes, progress notes, and discharge summaries18,19; this is no different in Japanese EMRs20,21. In particular, progress notes are frequently written by physicians (hereinafter referred to as MD notes) and nurses (hereinafter RN notes)19, and are considered to be the most suitable target of analysis for capturing the constantly changing conditions of patients. The biggest difference between MD and RN notes is linked to the different roles that the respective documenters fulfill in medical care. For example, a progress note for the same patient written on the same day by a physician and a nurse will differ because they focus on different things; accordingly, there might be heterogeneity in the information noted. In particular, nurses are specialists at noticing and caring for patients’ symptoms22,23, so their notes may contain a substantial amount of symptom-related information. However, few studies have quantitatively examined the differences between MD and RN notes.

Therefore, the present study utilized the NLP software developed by the Nara Institute of Science and Technology, MedNER-J13,24. MedNER-J can analyze and extract information from Japanese narrative medical documents and codes complying with the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (hereinafter ICD codes)25. In the present study, with the aim of gaining a better understanding of the characteristics of both MD and RN notes, we conducted a quantitative comparison based on extraction rates. In particular, we focused on ICD Chapter 18 (R00–R99, hereinafter R codes), which refer to the names of symptoms, as a possible way to clarify the differences between MD and RN notes and those with occupational backgrounds.

Methods

Study population

Among the 11,046 patients who were hospitalized at Kagawa University Hospital (KUH) from January to December 2021, those who were admitted to the Gastroenterology Department as their main department and to the digestive system ward were targeted for analysis (Fig. 1). KUH is the only national university hospital in Kagawa Prefecture, which is located in the Shikoku region of Japan and has a population of approximately 1 million. To unify the organ specialty of physicians and nurses writing progress notes, we excluded patients who had been admitted to a non-gastroenterology department or a non-digestive system ward (n = 9907) or who had a history of ward transfer (n = 2786). As a result, a total of 1113 eligible patients were identified and randomly enrolled until the patients’ cumulative total hospital stay exceeded 800 days, based on the sample size calculation described below (see the Statistical Analysis section). Finally, a total of 83 patients were enrolled in this study. The total number of hospital days for the 83 patients was 806 days. As background information, we investigated these patients’ age, sex, name of the illness they had been hospitalized for, and duration of hospital stay.

Figure 1
figure 1

Study population. To unify the organ specialty of the physicians and nurses writing progress notes, we randomly selected 83 patients from among those hospitalized in the Gastroenterology Department and digestive system ward.

This study was approved by the Research Ethics Committee of Kagawa University Faculty of Medicine through an ethical review (receipt No. 2022-007) and was performed in accordance with the Declaration of Helsinki and the Ethical Guidelines for Medical and Biological Research Involving Human Subjects in Japan. Informed consent was obtained from all patients and/or their legal guardians through an opt-out approach prior to conducting the study.

Data set preparation

Progress notes for 806 days linked to the 83 patients written by physicians or nurses were collected from the EMR system (HOPE EGMAIN-GX; Fujitsu Ltd., Tokyo, Japan) at KUH and used as an analytical data set. To distinguish the date of each collected progress note, the recording dates for each document were also obtained at the same time (index date). Information about the number of documents, the number of days for which the document existed, the number of documents per hospitalization, and the number of characters per document was also collected. In addition, for all the documents included in the analytical data set, the physicians and nurses who wrote them were identified; the top 10 physicians and nurses who wrote the most documents were also identified and the total number of documents written by these top 10 members was calculated. Furthermore, part of the analytical data set, 106 of 806 days, was used as a validation data set to evaluate the performance of MedNER-J (described below in the Validation of MedNER-J section). The validation data set was created by repeating random inclusion for each patient unit until the total number of hospital days exceeded 10% of the analytical data set26. The validation data set consisted of 10 patients.

Experimental environment

In this study, MedNER-J was operated with the default settings (using a model named BERT, and post-processing using a Japanese disease/symptom name dictionary named J-MeDic27) in an environment constructed with the Ubuntu 20.04.3 LTS operating system, the Python 3.8.10 programming language, the MeCab 0.996 Japanese morphological analyzer28, and the IPAdic 2.7.0 MeCab dictionary29 to extract the ICD codes. MedNER-J extracts the name of the illness from the narrative text of Japanese medical documents by applying conditional random fields to bidirectional encoder representations from transformer embeddings30. It also supports the processing of negation and can determine factuality (i.e., distinguish between positive and negative findings). MedNER-J can also assign an ICD code to the extracted illness name by post-processing using J-MeDic. This software is available for public use under the BSD 2-Clause “Simplified” License, and the source code is publicly available on GitHub (https://github.com/sociocom/MedNER-J).

Validation of MedNER-J

A validation test was performed to evaluate whether the performance of MedNER-J was sufficient for this study. To this end, a board-certified gastroenterologist of the Japanese Society of Gastroenterology (Y.M.) manually reviewed the validation data set to define the gold standard (GS) and determined the applicability of the R codes in each progress note. For each patient, each day (every index date) with one or more expressions that corresponded to an R code was classified as “GS-positive”, and the absence thereof as “GS-negative”. Processing of the validation data set using MedNER-J was performed in parallel. Similarly, for each patient, each day (every index date) with one or more extractions that corresponded to an R code was classified as “MedNER-J-positive,” and the absence thereof as “MedNER-J-negative.” Cohen’s kappa coefficient31 with the 95% confidence interval [CI] was calculated as an index of concordance between the GS and MedNER-J for each MD and RN note, and the R-code extraction performance of MedNER-J was evaluated according to the criteria of Landis32.

Comparison between MD and RN notes

After processing for the analytical data set using MedNER-J was performed, as a primary evaluation, MD and RN notes were compared using the R-code extraction rates as an indicator (Fig. 2). Notes were classified as “Positive” if there was even one extraction of an R code per patient per day (every index date), and “Negative” otherwise. In the MD and RN notes, the number of “Positive” days was divided by 806 to calculate the R-code extraction rate with the 95% CI, i.e., the percentage of days in which an R code was extracted.

Figure 2
figure 2

Conceptual diagram. MD and RN notes for the same patients written on the same days were analyzed using MedNER-J, and then compared in terms of R-code extraction rates. Abbreviations: MD notes, progress notes written by physicians; RN notes, progress notes written by nurses; R codes, Chapter 18 R00–R99 (Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified) of the International Classification of Diseases.

As a secondary evaluation, we compared the ICD code extraction rate from MD and RN notes by R-code subcategories (i.e., by symptom name details), and by chapter-level categories of the ICD codes other than Chapter 18 (R codes), which mainly consists of disease names. As with the primary evaluation, the number of “Positive” days was divided by 806 to calculate the extraction rate (%) with the 95% CI for each ICD code. Furthermore, daily changes in the number of R-code extractions from MD and RN notes were compared per patient, and some patient cases were assessed based on the clinical course.

Statistical analysis

Based on estimations obtained using G*Power 3.1.9.633, a sample of 796 days was needed to provide 80% power to detect differences in the R-code extraction rate between MD and RN notes, with a two-sided type I error of 0.05, assuming that discordant pairs in the primary evaluation (i.e., “Positive” MD notes against “Negative” RN notes, or “Negative” MD notes against “Positive” RN notes) would occur in one fourth of the total at a ratio of 1.5. In addition, McNemar’s test was performed using R 4.1.234 for hypothesis testing regarding the difference in the proportions in the primary and secondary evaluations. For all analyses, a two-sided p value < 0.05 was considered statistically significant.

Results

Background characteristics

Table 1 shows the clinical backgrounds of the 83 patients analyzed in the present study (43.4% female; median age, 74 years). Regarding the name of the main illness for hospitalization, gastrointestinal disease was the most common (45.8%), followed by biliary disease (22.9%) and liver disease (19.3%). The median duration of hospital stay per patient was 8 days. Table 2 shows the analytical data set, which consisted of 1415 MD and 3456 RN notes. Of the 806 days for which we attempted to collect progress notes, MD notes were present in 92.4% and RN notes in 99.9%. There was a median of 12 MD and 29 RN notes per hospitalization, respectively. Each MD and RN note had a median of 409 and 122 characters, respectively. The MD notes were written by a total of 74 physicians, while the RN notes were written by a total of 68 nurses. Of these, documents written by the top 10 members in terms of the number of documents written made up 69.0 and 43.9% of MD and RN notes, respectively.

Table 1 Clinical backgrounds of the patients.
Table 2 Characteristics of the data set.

Performance of MedNER-J

In the validation data set, the MD and RN notes were defined as “GS-positive” in 80 (74.5%) and 97 (91.5%) of the 106 days, respectively. By contrast, the MD and RN notes were defined as “MedNER-J-positive” in 81 (76.4%) and 92 days (86.8%), respectively. As for the concordance between GS and MedNER-J, Cohen’s kappa coefficients for the MD and RN notes were 0.61 (95% CI 0.44–0.79) and 0.66 (95% CI 0.43–0.89), respectively, both showing substantial concordance according to the criteria of Landis.

Comparison by R codes and other ICD codes

Tables 3 and 4 show the extraction rates for each code from the MD and RN notes, calculated by category. Three notable categories of these results are shown in Fig. 3. The R-code extraction rates from the MD and RN notes using MedNER-J were 68.5% (552 days) and 75.2% (606 days), respectively, of the 806 days in the analytical data set. For R-code extraction, discordant pairs occurred in 33.0% (266 days). The R-code extraction rate from RN notes was significantly higher (p = 0.00112). When comparing by R-code subcategories, the extraction rates for R10–R19, which are names of digestive system symptoms, from the MD and RN notes were 44.2 and 37.5%, respectively. The extraction rate for R10–R19 from MD notes was significantly higher than that from RN notes (p = 0.00299). Conversely, the extraction rate from RN notes was significantly higher than that for many other R-code subcategories, such as R40–R46 (MD notes 8.4% vs. RN notes 19.4%, p < 0.001) and R50–R69 (44.0 vs. 59.4%, respectively; p < 0.001). When comparing by chapter-level categories of ICD codes, the extraction rates for Chapter 11 K00–K93, which are disease names of the digestive system, were 68.4% from MD notes and 30.9% from RN notes (p < 0.001); i.e., the extraction rate from MD notes was significantly higher. Excluding Chapter 5 F00–F99 and Chapter 19 S00–T98, the extraction rate from MD notes was significantly higher in most of the other chapter-level categories of the ICD codes.

Table 3 Comparison between MD and RN notes by R code and subcategory.
Table 4 Comparison between MD and RN notes by ICD codes other than R codes.
Figure 3
figure 3

Comparison of extraction rates in three categories of ICD codes. MD notes and RN notes are compared and shown in terms of each code extraction rate. R code, i.e., ICD Chapter 18 R00–R99, consists of symptom names, and R-code subcategory R10–R19 consists of names of digestive system symptoms. Meanwhile, ICD Chapter 11 K00–K93 consists of disease names of the digestive system. Abbreviations: MD notes, progress notes written by physicians; RN notes, progress notes written by nurses; ICD, International Classification of Diseases.

Case presentation

Case 1

Case 1 was a woman in her 70 s diagnosed with an infected pancreatic cyst (Fig. 4a). Previously, she had been hospitalized for more than 1 month and received conservative treatment. Her condition was improved by fasting, but recurred once her normal diet was resumed. Various causes were investigated by the previous physician, but no clear cause was identified; therefore, she was referred to KUH. Her symptoms had subsided on Day 2 at KUH, so the reintroduction of her diet was attempted, but fever was observed again on Day 5. She fasted again and was followed up, but her fever and laboratory findings suddenly exacerbated on Day 8. On the same day, emergency abdominal echography showed common bile duct stones. Emergency endoscopic biliary and pancreatic stenting were performed, and she was cured without recurrence even after the resumption of her diet. Endoscopic lithotripsy was performed on Day 24, and she was discharged on Day 27. The number of R-code extractions from RN notes exceeded that from MD notes during most of the days during the patient’s hospitalization. The number of R-code extractions from RN notes increased dramatically on Days 7 and 21. In particular, the R codes extracted from RN notes on Day 7 were examined individually, revealing that codes that would allow for prediction of exacerbation on Day 8, such as R10 (Abdominal and pelvic pain) and R50 (Fever of other and unknown origin), were extracted multiple times.

Figure 4
figure 4

Case presentation. The lines indicate the laboratory test values, and the bars indicate the number of R-code extractions from each progress note (Black, MD notes; White, RN notes). Some lines are colored gray for readability. (a) The patient was a woman in her 70 s diagnosed with an infected pancreatic cyst. (b) The patient was a woman in her 70 s diagnosed with hepatocellular carcinoma. Abbreviations: R codes, Chapter 18 R00–R99 (Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified) of the International Classification of Diseases; MD notes, progress notes written by physicians; RN notes, progress notes written by nurses; ALP, alkaline phosphatase; γ-GT, gamma glutamyl transpeptidase; Amy, amylase; WBC, white blood cell count; CRP, C-reactive protein.

Case 2

Case 2 was a woman in her 70 s, diagnosed with hepatocellular carcinoma with underlying hepatitis B virus-related cirrhosis, who was admitted to KUH for treatment (Fig. 4b). As initially planned, transcatheter arterial chemoembolization (TACE) was performed on Day 2. She was following a favorable postoperative course for several days, but bloody stool and abdominal pain were observed on Day 7. Radiofrequency ablation (RFA) was originally planned for Day 8, but was suddenly postponed. As a result of the search for the cause of the bloody stool, an unexpected complication, the patient was diagnosed with acute hemorrhagic colitis associated with prophylactic antibiotic therapy for TACE. The bloody stool did not progress to anemia, and her condition improved by discontinuing the suspected cause to observe her course. On Day 16, the postponed RFA was performed, and in preparation, another antibiotic from what was believed to have caused the bloody stool was administered for prophylaxis. Her bloody stool did not reappear after RFA, but she complained of abdominal pain again on Day 18. As the abdominal pain was an expected symptom after the procedure, she received follow-up with symptomatic therapy alone. Her abdominal pain subsequently disappeared, and she was discharged on Day 29. Regarding the laboratory test values, hemoglobin remained within the normal range (data not shown in the figure), and indicators of inflammation such as white blood cell count and C-reactive protein remained within the permissible range after RFA. The number of R-code extractions from the RN notes exceeded that from the MD notes during most of the days of the patient’s hospitalization, similar to Case 1. Both increased several days after each procedure, but a particularly noticeable peak was observed from Days 7 to 10. An individual examination of the extracted R codes showed that R58 (Haemorrhage, not elsewhere classified) was extracted with the appearance of bloody stool in both progress notes. Moreover, codes such as R42 (Dizziness and giddiness), representing symptoms suspected to have occurred secondary to a hemorrhagic event, were extracted from only the RN notes.

Discussion

To our knowledge, this is the first study to show information heterogeneity between MD and RN notes through a comparison of R-code extraction rates, which were calculated based on symptom names from each note written for the same patient on the same day using MedNER-J. To carry out this study, MedNER-J showed sufficient performance when compared with a chart review by a human expert for both MD and RN notes. The content of medical documents is affected by not only the patient or disease condition, but also various other factors, including the documenter’s experience35, so the presence of information heterogeneity between MD and RN notes is understandable. The heterogeneity observed in this study is a bias that should be recognized by NLP system developers. At the same time, the findings suggest the importance of interpreting the output with an understanding of the characteristics of each medical document by NLP system users. In particular, many systems that utilize artificial intelligence algorithms are more affected by their training data36; therefore, medical documents that are used as training data should be treated with special measures to prevent erroneous decision-making.

The analytical data set used in the present study included several days when there were no MD notes, whereas RN notes existed for almost all days; this might have been affected by the work schedules of the physicians and nurses. In other words, attending physicians in Japanese hospitals generally work overtime on weekends and holidays only when necessary to respond to emergencies37, whereas nurses take shifts and provide care without any breaks in between and throughout weekends and holidays38. These differences in the work schedules of physicians and nurses may have been reflected in the differences in the number of days when progress notes were available out of the total number of days during patient hospitalizations. Another difference was that, while there were fewer documents themselves in terms of MD notes per hospitalization compared with RN notes, there were more characters per document. Whereas physicians summarized the daily clinical course of a patient in one document, nurses seemed to be recording details across several documents.

Although a significantly higher R-code extraction rate was observed for RN notes, R-code subcategory R10–R19 was extracted significantly more from MD notes. In comparisons by chapter-level categories of ICD codes, MD notes had a significantly higher extraction rate for most categories, except for Chapter 5 F00–F99 and Chapter 19 S00–T98. These results suggest that MD notes are written using symptom and disease names according to the documenter’s specialty (in this study, the digestive system domain), but are mainly written using disease names for fields outside of their specialty. On the other hand, nurses generally used disease names infrequently, but used symptoms for various fields without being biased toward a specific domain when writing progress notes. This may be linked to the Japanese Medical Practitioners’ Act, which stipulates that “No person except a medical practitioner may engage in medical practice” because making diagnoses is generally considered an absolute medical practice, and only physicians have permission to diagnose exclusively39. The history and availability of nurse practitioners who are allowed to make diagnoses partially are still limited in Japan40, and such nurses were not working at KUH, at least not during the study period. In addition, a study comparing the terminology used by physicians and nurses showed that nurses write the patient’s condition specifically using a variety of terms and expressions, thereby aiming for close information-sharing between nurses and realizing better care practice41. The occupational backgrounds of nurses may also explain why RN notes had fewer expressions of disease names but numerous expressions of symptom names.

Furthermore, in the two case presentations, the R-code extractions from the two types of documents generally matched clinical events, visually speaking. In particular, R-code extractions from RN notes showed sharp fluctuations. In part, the increase in R-code extractions from RN notes was earlier than the worsening of laboratory test values. Generally, there are many clinical conditions in which the patient’s symptoms change before laboratory values, suggesting that this may be reflected by changes in the number of R-code extractions from RN notes. A classic textbook on abdominal examination42, Cope’s Early Diagnosis of the Acute Abdomen, also explained that more diagnoses will be made through the history of a patient’s condition than through various tests, and health-care practitioners must be aware of the earliest symptoms to recognize the early stages of the disease43.

The results of this study suggest that accurate extraction of R codes from RN notes is one of the most important factors in considering the development of future clinical decision support systems using narrative documents in EMRs. In other words, the real-time signal detection of changes in the patient’s condition based on R-code extraction from RN notes, and the notification of such changes to physicians, may be the key to an early diagnosis. Previous studies on the development of information extraction algorithms44 or prognosis prediction algorithms45 have reported the advantages of using nursing records as a data source compared with physician records only. Based on the findings of the present study, we posit that the diverse range of symptom descriptions within nursing records, unbiased toward any specific specialty, could positively contribute to the effectiveness of machine learning. These findings also emphasize the importance of daily documentation in nursing practice, especially in terms of recording patients’ symptoms. In addition, the present findings highlight the importance of information heterogeneity in medical documents, which should be noted when considering the secondary use of medical documents using NLP. The analysis of medical documents by NLP is expected to improve efficiency in the field of clinical research1; therefore, clarification is required as to whether the information to be collected is a disease name or a symptom name. In other words, MD notes should be analyzed if the intention is to collect the disease name, whereas RN notes should be analyzed if the intention is to collect the symptom name. Thus, it is necessary to select the applicable subject for the analysis carefully, taking the information heterogeneity of medical documents into account. Furthermore, as adopted in the present study, by extracting essential information from progress notes using NLP in an efficient and structured manner as standardized codes such as R codes or ICD codes, the extraction frequency of each code may be utilized as a quality indicator for medical safety management. For example, if a significant code extraction is found from the progress notes, it will be possible to detect this as a signal of some kind of abnormality and take appropriate clinical action. We expect that the extraction of standardized codes from progress notes will be further explored in the future.

This study has several strengths. To our knowledge, this is the first report to compare quantitatively the amount of information in physician and nursing records from the perspective of NLP. Various reports have attempted to extract information from medical documents7,8,9,10,11,12,13, but none of these studies have compared quantitatively the amount of information between different documents linked to identical patients. We achieved this comparison by converting various narrative representations into standardized codes. Second, a well-planned research protocol was designed for the present study. The NLP software used in this study was validated based on GS definitions by human experts; we believe that this process improved the reliability of the results in the comparison among documents in the second half of the study. This is also the first report to visualize the increase in symptom names in physician and nursing records with a corresponding increase in the level of medical need. On the other hand, this study also has some limitations. First, it was conducted at a single institution and within a single disease domain. Furthermore, as shown in Table 2, the analytical data set contained progress notes written at a high rate by specific physicians or nurses. Therefore, it is possible that diversity in progress notes by institution, disease domain, or documenter has not been sufficiently considered. Second, this study was an experiment using MedNER-J NLP software. If different NLP software had been used for analysis, it might have changed the code extraction performance, and consequently, the code extraction rate. Third, here, we focused only on inpatient progress notes; outpatient progress notes were out of the scope of the present study. In outpatient care, the frequency and content of RN notes are expected to be quite different. Lastly, this study was conducted in Japan, where the nurse practitioner system is not widespread. Under the nurse practitioner system, specific nurses are responsible for some of the diagnoses, in which case RN notes likely include more disease names than what was found in the present analysis.

Conclusion

The findings of this study showed quantitative differences in that nurses wrote symptoms more frequently in progress notes than did physicians. By contrast, physicians wrote more disease names in progress notes than did nurses. However, limited to their specialty fields, physicians also wrote symptoms frequently. On the other hand, no biases by organ specialty were found in RN notes. In addition, there were almost no days throughout the patient’s hospitalizations in which RN notes were not written, and RN notes were written multiple times a day. These results suggest that extracting R codes from RN notes allows the extraction of more fine-grained information about patient conditions that might change day-by-day. Therefore, gaining a better understanding of the characteristics of each medical document is necessary for the proper application of NLP in medical care.