Introduction

Secondary use of electronic healthcare records for registry research requires the consistent and comprehensive extraction of diagnoses. Reimbursement diagnoses are an obvious choice because these already come in a code system such as the International Classification of Diseases 10 (ICD-10)1. However, administrative diagnoses are potentially incomplete because the medical history is commonly omitted2,3. Consequently, additional clinical data improved performance in comparison to using reimbursement diagnoses alone in a specific classification task4.

The physicians’ reports are a better source for diagnoses because they usually cover the medical context more comprehensively in comparison to reimbursement diagnoses. The major downside is that natural language needs the transformation into a code system before the data can be used in registries5.

This transformation is complicated by the abundance of grammatical variants in physicians’ reports. These necessitate sophisticated grammatical algorithms specifically adopted for the medical domain6,7,8. Such methods have been recently proposed for English medical texts but not for medical texts in German yet. Consequently, we are not aware of a method to automatically infer ICD-10 codes from German ophthalmological physicians’ reports in natural language.

Word vector embeddings are a language-neutral and grammar-free purely data driven approach and have already been used in many medical natural language processing tasks and applications9. Word vector embeddings map words into a multidimensional number space on the basis of semantic relatedness alone. Word2vec e.g. uses a shallow neural network for this purpose10. Word2vec is a method for creating word embeddings, which are vector representations of words capturing their semantic relationships.

We herein describe a methodology to infer ICD-10 codes from ophthalmological physicians’ notes in German language based on Word2vec embeddings created from two gigabytes worth of historical physicians’ reports from one university eye hospital in combination with a large thesaurus of ICD-10 codes.

We evaluate the performance of our method in this hospital and additionally in two other German eye hospitals. This was done to assess how far the method is suitable to the multi-centric setting, such as for automated feeding into a nation-wide ophthalmologic registry without specific adaptations to each of the contributors.

Methods

Building the embedding space

Our embedding is based on the Word2vec method, which maps words into a multidimensional number space based on semantic relatedness as described previously5. We used a ‘phrases-file’ extension11 to maintain the meaning of multi-word diagnoses. The phrases file comprised annotations from the ‘Alpha-ID’ catalog and all annotations to ICD-10 codes entered into the administrative database of the University Hospital Freiburg over the last 15 years. The corpus file contained plain text from anonymized physicians’ reports written between 2003 and 2018 at the Eye Center of the University Hospital Freiburg. Pre-processing steps included eliminating numbers higher than 10, non-alphanumeric characters, splitting composite words, transcribing German umlauts, and converting all words to lowercase. The Word2vec executable was run with specific settings, such as a window size of 5 words and 5 iterations to optimize the neural network. For example, For example, the word ‘cataract’ might be represented by a 300-dimensional vector [0.12, − 0.08, …, 0.23], capturing its semantic context within the corpus. A multi-word phrase such 'rhegmatogenous retinal detachment' would be treated as a single entity in the embedding space, preserving its semantic meaning. The resulting embedding space comprises a vocabulary of 347,370 distinct words or phrases in 300 dimensions.

Annotating the embedding space

We compiled a comprehensive ICD-10 thesaurus from the aforementioned Alpha-ID catalog together with the manual annotations to ICD-10 codes that had been entered into the administrative database of the University Hospital Freiburg. After cleanup and removing duplicates we obtained a coding thesaurus with 84,419 natural language annotations for 16,132 distinct ICD-10 codes. We annotated the embedding space with this thesaurus by means of ‘placing’ the centroids of all embedded thesaurus entries for each ICD-10 code. These centroids serve as the targets for the nearest neighbor search as described below.

Inference of ICD-10 codes

We opted for nearest-neighbor search to link inference queries to the most appropriate ICD-10 code. When a query comprises multiple words, we numerically average the embedding vectors column-wise. When the query comprises a phrase from the phrases file, the phrase embedding is concatenated to the vector array before averaging. Finally we look for the nearest ICD-10 cluster in embedding space to the averaged embedding of the query sequence. We also calculate the cosine distance between the query embedding and the centroids of the ICD-10 cluster. This enables thresholding to reduce misclassifications in case no sensible neighbor is found. Model inference was implemented using the R system. We set up an intranet web service for our inference studies.

Validation study

We included a total of three German eye hospitals with no overlap in personnel or patients. The Eye Center of the University Hospital Freiburg, the source of the Word2vec embedding space, was the first hospital (H1). Here, we made sure that none of the physicians’ reports of the evaluation study had been part of the corpus that was used for generating the embeddings. The second hospital (H2) was the Department of Ophthalmology, Asklepios Hospital Nord-Heidberg, Hamburg, Germany. The third hospital (H3) comprised the Department of Ophthalmology, University of Münster Medical Center.

We requested all diagnosis segments of 100 physicians’ reports from all three hospitals. All data were manually anonymized at the sending hospital in order to ensure complete anonymity before sending the texts away for inference. The diagnosis segments were extracted from the reports using regular expressions in the perl programming language. All segments were manually reviewed to remove non-diagnoses such as medication mentions upfront. Inference was performed using an intranet based implementation of our algorithm. The inferred ICD-10 codes were returned to the senders for evaluation. Evaluation comprised assessment whether the ICD-10 code was either completely correct or, in a secondary assessment, at least correct for the superordinate disease group (e.g. rhegmatogenous retinal detachment and serous retinal detachment both are retinal detachments).

Ethics and data protection

We did not need to seek advice from the ethics committees because the data of this study had been anonymous from the start. It was neither possible nor necessary to obtain informed consent from the patients for the same reason. All methods were carried out in accordance with relevant guidelines and regulations.

Ethical approval

Ethics approval was not needed because all data in this study were anonymous from the start. In Germany, research with data that cannot be traced back to natural persons is not regulated by the professional codes of the medical associations. We therefore did not seek ethical advice as we only used anonymized data from the start. Furthermore, the EU GDPR does not apply to anonymized data (detailed in recital 26 of the EU GDPR). On this basis, we herewith confirm that all methods were carried out in accordance with relevant guidelines and regulations.

Results

A total of 3332 word sequences (812 H1, 1473 H2, 1047 H3) were extracted. 2806 were identified as diagnoses. A total of 2806 ICD-10 codes were inferred by our methodology (771 H1, 1226 H2, 809 H3). In H1, 98% were classified as completely correct (CC) and 99% as correct (C) concerning the superordinate disease concept. The percentages in H2 were 69% (CC) and 86% (C). The respective numbers for H3 were 69% (CC) and 91% (C). Examples for correctly and incorrectly inferred ICD-10 codes are shown alongside the input texts in Table 1.

Table 1 Representative examples from the validation study.

Table 2 presents the CC-accuracy of ICD-10 code inference for different disease groups across the three hospitals. Common eye disorders such as disorders of refraction and accommodation (H52), senile cataract (H25), and disorders of the vitreous body (H43) achieved high accuracy rates (> 80%) across all hospitals. However, the accuracy for some disease groups, such as other retinal disorders (H35) and disorders of the globe (H44), varied considerably between hospitals. Rare diseases, denoted by an asterisk (*), generally had lower accuracy rates, with some conditions like congenital malformations of the posterior segment of the eye (Q14) and failure and rejection of transplanted organs and tissues (T86) having 0% accuracy in H3.

Table 2 Percentage CC-Accuracy of ICD-10 code inference for different disease groups across the three hospitals.

Notably, H1 achieved 100% CC accuracy for most disease groups, likely due to its contribution to the training data. H2 and H3 showed lower accuracy rates for certain disease groups, such as type 2 diabetes mellitus (E11) and other retinal disorders (H35). Some disease groups, such as iodine-deficiency-related thyroid disorders (E01) and personal history of allergy to drugs, medicaments, and biological substances (Z88), were only present in H3, while others, such as transplanted organ and tissue status (Z94) and family history of other specific disorders (Z83), were only found in H2.

Discussion

Our data indicate that it is possible to infer ICD-10 codes from routine German physicians’ reports at high precision. This especially holds when the training data closely match the queries as had been the case in the first hospital. Although we ensured that there was no direct overlap between the letters from the training corpus and the validation study, some physicians may have contributed text material to both the training corpus and also to the validation study. We speculate that characteristic and stable language preferences may have improved the performance in the validation study of the first hospital, e.g. by means of reducing out-of-vocabulary queries. Another reason for the excellent performance in the first hospital is that the ICD thesaurus had been derived from all manual annotations to ICD-10 encodings retrieved from the administrative database of the same hospital. This ensured that all non-ophthalmological diagnoses of all specialities in this large university hospital are covered, including rare disease peculiarities. Both advantages had not been available at the other two hospitals.

A use case of our method is to provide diagnoses for the nation-wide registry “oregis”, a subsidiary of the German Ophthalmic Society, directly from physicians’ reports inside a special adaptor software installed locally12. This approach would be feasible only if the inference also worked for hospitals or clinics that did not contribute to the training corpus and thesaurus. This is why we included the two other hospitals into our validation study. It is important to note that the second hospital was not a university center but a regional hospital with a patient spectrum more similar to office-based ophthalmologists in comparison to the university hospitals.

Interestingly, H2 and H3 yielded relatively high levels of correctness, which may be sufficient for downstream tasks focusing on more common diseases and relying on relative comparisons, such as temporal changes or cross-sectional analyses, rather than absolute numbers.

However, other use cases may demand the higher degrees of accuracy observed in H1. Out-of-vocabulary queries were identified as major contributors to the performance degradation in H2 and H3. These queries were primarily attributed to center-specific abbreviations or diagnosis synonyms. For example, H2 frequently used the abbreviation 'PDRP' for proliferative diabetic retinopathy, which is not commonly used in H1. Additionally, non-ophthalmological and rare disease diagnoses arising from local peculiarities may also contribute to the performance differences. For instance, H3 utilized abbreviations such as "KDIGO," suggesting a greater involvement in treating patients with kidney disorders compared to H1.

Such local peculiarities could be a source of bias in the unmodified application of our method when feeding "oregis". Specific diagnoses may be systematically under-reported due to hospital-specific variations in terminology and falsely pretend regional variations in prevalence or incidence of certain diseases. Given the inclusiveness of oregis with the aim to eventually reach all German ophthalmologists, the method should be refined to become more robust against regional variations in the physicians’ letters.

Because the computational resources needed by our method are rather limited, the Word2vec embedding could be easily redone overnight after extending the training corpus with historical physicians’ letters from other sites. This could even be done progressively just before new sites are added. These improvements would most likely mitigate the degradation of inference accuracy from novel sites and bring the overall performance closer to the one from the first hospital in our validation study.

A major challenge for disseminating this method to other hospitals or medical-surgical disciplines, however, will be to guarantee compliance with the General Data Protection Regulation of the European Union. Automated anonymisation systems have already been proposed for English datasets13. However, we are not aware of de-identification tools for the German language that are sufficiently robust for giving away large numbers of physicians’ letters to a third party without consent from the patients. For this reason, our method has currently to be locally implemented at each site separately. Obviously, more research is warranted in this field in order to enable a continuously growing corpus of clinical text for making this available 'of-the-shelf' for the broader medical domain in Germany.

Conclusion

In conclusion, our study demonstrates the feasibility of automatically inferring ICD-10 codes from German ophthalmology medical records using a Word2vec-based approach. The proposed method achieved high accuracy, particularly when trained on data from the same hospital. However, performance variations across hospitals highlight the need for further refinement and adaptation to hospital-specific terminologies. Future work should focus on expanding the training corpus, improving the handling of rare diseases, and facilitate the application's dissemination to other medical domains and institutions.