Introduction

High-resolution imaging of the central retina utilizing optical coherence tomography (OCT) plays a key role in the diagnosis and monitoring of the most common macular diseases such as age-related macular degeneration (AMD), diabetic macular edema (DME), and retinal vein occlusion (RVO)1,2. OCT biomarkers are specific properties of measurements that are extracted from the OCT images to provide information about the condition of the tissues and tissue layers within the human eye. Furthermore, detailed analysis of different biomarkers on OCT scans is now the basis for treatment decisions as several biological markers provide not only information on diagnosis of these particular eye conditions, but also play an important role in predicting the treatment response. With the increasing amount of available data for therapeutic strategies, the identification of biomarkers with predictive values, as well as different medication options such as intravitreal operative medication therapies (IVOM) for macular edema or diabetic retinopathy patients, it is challenging for ophthalmologists to individualize the therapy for each patient. Artificial intelligence (AI) based algorithms should, in the future, help to find optimal individual therapeutic strategies for each patient. Real-world studies have already demonstrated how important the early treatment begin and the compliance with the therapy are3,4,5. Yet, real-world studies need more attention, where we can contribute with a settings of real-world study data. The study group of Gerding et al.6 was one of the first ones to classify patients with AMD into three treatment success groups based on visual acuity (VA) and central retinal thickness: therapy “winners”, “stabilizers”, and “losers” (WSL classification scheme). This interdisciplinary cooperation between IT specialists and ophthalmologists aims at analyzing the patients’ data according to the WSL classification scheme while identifying predictive values for several OCT biomarkers. The results should help ophthalmologists to better define their therapy strategy for each patient in everyday practice. Moreover, Schmidt-Erfurth et al. recently reported the potential of AI-based approaches for targeted optimization of diagnosis and therapy for eye diseases7. In their contributions, they furthermore describe the impact of deep learning (DL) for the prediction of patient progressions in the earlier stages of AMD utilizing OCT biomarkers8,9. Whereas current state-of-the-art research also explores the explainability as well as the related nomenclature when reporting AMD-related diseases10,11, AMD such as exudative AMD as well as DME and RVO can be seen as the three most prevalent investigated eye diseases within the context of AI11,12,13,14.

Figure 1
figure 1

Patient progression visualization and modelling dashboard developed with medical doctors for medical doctors as well as researchers. Visualized are general patient information, visual acuity, OCT biomarkers, diagnoses, and medications. It is possible to annotate the expected course of the VA on site at the position of the red question mark. The shown data set is inspired by real patients and is synthesized to avoid re-identification. The distance between two adjacent vertical guide lines is 50 days.

Related work

In the following sections, we discuss ophthalmic research on data mining from clinical IT systems (Section “Availability of ophthalmic data in research”), text and OCT image processing (Section “Ophthalmic text and OCT image processing”), and the use of the processed data for patient progression modelling (Section “Patient progressing modelling”).

Availability of ophthalmic data in research

In recent years, more and more ophthalmic data sets are being released15,16. A recent review article identified 94 open access data sets containing 507,724 images and 125 videos from 122,364 patients with diabetic retinopathy, glaucoma, and AMD being disproportionately over-represented in comparison to other eye diseases. However, the documentation of demographic characteristics such as age, sex, and ethnicity was reported to be of poor data quality even at the aggregate level17. In 2017, we proposed a prototypical workflow to aggregate ophthalmic text and image data of all patients from the Department of Ophthalmology of the maximum care hospital Klinikum Chemnitz gGmbH in Chemnitz, Germany. We combined data mining and basic natural language (NLP) processing utilizing the interface of the clinic’s practice management software Turbomed (CompuGroup Medical) and extracted a set of preliminary diagnostic patient data in order to determine the ratio of patients with VA improvement, stabilization, and deterioration18.

Ophthalmic text and OCT image processing

While widely being used for general text processing, NLP systems have recently been demonstrated to robustly extract medication information from clinical notes to study VA, intraocular pressure, and medication outcomes of cataract and glaucoma surgeries19,20 to develop predictive models for low-vision prognosis21 as well as to predict glaucoma progressions22. Following De Fauw et al.23, especially machine learning (ML) and deep learning based approaches enable a more precise progression modelling as recent advances prove their applicability and capabilities within the domain. Moreover, recognition of OCT biomarkers24 allows further VA and treatment based medical forecasts, including ensemble-based solutions to OCT segmentation25 that enable the completion of incomplete OCT documentations. However, automated thresholding algorithms can yield an improved reproducibility of OCT parameters while allowing a more sensitive diagnosis of pathologies when, e.g., discriminating between healthy and impaired macular perfusion26.

Figure 2
figure 2

Proposed medical text and image data mining workflow for the following descriptive statistics and VA modelling based on our three base systems of the electronic health record system (EHR or Turbomed system), the clinical information system (CIS system), and the OCT system. Our patient progression visualization and modelling dashboard is shown in Fig. 1. Currently, the classification and prediction step of treatment adjustment is still in development. However, given all data, a treatment recommendation can be realized.

Patient progressing modelling

Schmidt-Erfurth et al.8 investigate the influence of hyperreflective foci as OCT biomarkers during the progression of geographic atrophy within the context of AMD. While utilizing deep neural networks (DNN) for OCT segmentation27, they identify and localize occurrences given a data set of 87 eyes from 54 different patients. Following Schmidt-Erfurth et al.’s contribution, Waldstein et al. propose a further developed system while evaluating it with 8529 OCT volumes of 512 different patients9. Moreover, time-dependent sequence modelling using recurrent neural networks (RNN)28 constitutes a promising approach to treatment prediction29. At this point it is also noted that approaches to patient progression modelling exist that explore conventional models. These include mathematical models, e.g., to determine the effect of the anti-vascular endothelial growth factor on VA via medication concentration and tolerance30, as well as regression-based approaches, e.g., to predict VA in diabetic retinopathy via the foveal avascular zone area31.

Contribution of this work

In this contribution, we present an IT system architecture that aggregates patient information for more than 49,000 patients from different categories of various multimedia data in the form of text and images within multiple heterogeneous ophthalmic data resources. The resulting data corpus enables predictive statements to be made about the expected progression of a patient’s visual acuity after at least four VA examinations in each of the three diseases—AMD, DME, and RVO. A more fine-grained analysis is conducted to reveal the influence of medical co-existing factors such as other diseases in this real-world setting. Within our proposed multistage system, an ensemble of deep neural networks allows the completion of incomplete or missing OCT documentations. In order to conduct a patient progression modelling, we define a fundamental use case for the prediction of the VA by aggregating different patient information as input of the subsequent VA forecast after a given time period. We present an evaluation formalism and discuss the resulting predictions in comparison to those of a human annotator such as experienced ophthalmologists. In order to enable ophthalmic doctors to annotate their predictions regarding the patient-wise expected VA progression, our proposed patient progression visualization and modelling dashboard gives an overview of our aggregated data with patient-wise information of, i.a., general patient information and VA, OCT biomarkers, diagnosis, and medication information (Fig. 1).

Table 1 Exemplary (German) ophthalmic text processing rules for exudative AMD, DME, and RVO.

Methods and implementation

This section provides the methods and their implementation related to the patient progression modelling. Firstly, the proposed system’s architecture for data acquisition, preprocessing, analysis, and prediction is introduced (Section “System architecture”). To allow a unified data processing, the role of the related terminologies for medical application (ophthalmic ontology) is addressed (Section “Text processing and ontology”), whereas the data fusion and cleaning is introduced (Section “Text processing and ontology”). Subsequently, descriptive statistics are derived (Section “Category-centered data organization and descriptive statistics”). Finally, the principles of the patient progression modelling within the context of our ML- and DL-based approaches are introduced (Section “Patient progression modelling”).

The related implemented models and approaches to patient progression modelling as well as OCT biomarker classification are provided via open science with the machine learning framework Hexnet32 and can be found on its project page and repository under https://github.com/TSchlosser13/Hexnet (see also _ML/models/contrib). The framework Hexnet provides the functionality that mainly allows for the utilization of out-of-the-box ML- and DL-based methods and models, including common routines for data storage management, preprocessing, model training and testing, as well as evaluation. It was originally developed for hexagonal image processing and deep learning, while it has been recently further developed for classical machine learning. Its machine learning module (directory _ML/) is based on the machine learning library TensorFlow with its front end Keras, whereas scikit-learn is deployed for all machine learning based models and evaluation procedures. Within the current research work, Hexnet’s machine learning module has been extended to enable the processing of ophthalmic imagery, the data handling of our data vectors, as well as our ophthalmic evaluation through models such as, e.g., statistical, moving average (MA), and weighted MA estimators as well as recurrent neural networks (RNN). These new contributions can be found under models/contrib/ via ophthalmology_evaluation.py and RNNs.py). In comparison, already existing ML and DL models are also present within models/ and models/contrib/, covering regressors and our multilayer perceptron approaches as well as DenseNet201 and ResNet152V2 that are based on Keras or scikit-learn.

The present study was approved by the Institutional Review Board of Saxony (Dresden, Germany) under the number EK-BR-102/20-2. We confirm that all research was performed in accordance with relevant guidelines and regulations. Informed patients’ consent was waived because of the retrospective anonymous design and because no study-related investigations were necessary.

System architecture

Our ophthalmic core data set obtained from data as well as text mining is based on the categories of general patient information (G), VA-based patient information (V), OCT scans and biomarkers (O), diseases (D), as well as treatments and medications (T). These categories are obtained from different base systems, the electronic health record system (EHR system, Turbomed), the clinical information system (CIS system, SAP), and an OCT system (Heidelberg Eye Explorer) (see also Fig. 2). While all three systems contain basic patient information, the OCT system provides first and foremost OCT scans (categories G and O). The EHR system consists of one large database which contains all relevant patient information within the categories of G, V, D, and T. As the treatment and medication information within the EHR system is not always complete, the clinic’s CIS system is additionally utilized in order to retrieve missing medication and therapy information (category CIS). Following the retrieved data of all three base systems in the form of BDT (EHR system), CSV (CIS system), and E2E files (OCT system), all relevant information are merged in a patient-centered way. This includes a chronological synchronization of all patient information, whereas sensitive patient information has to be furthermore pseudo-anonymized due to patient data privacy and protection laws. These results are then presented via our patient progression visualization and modelling dashboard (Fig. 1).

Figure 3
figure 3

Visual acuity (VA) modelling and evaluation principles. Given a set of past visual acuity examinations, we predict the visual acuity value at the next time step. Top: Visual acuity progression of our exemplary patient from Fig. 1. Predictions of the visual acuity are conducted by our MLP-LDA model starting with the 5th VA value. In addition, we asked an ophthalmologist to predict the VA. As shown in our visualization, both the predictions the ophthalmologist as well as our proposed model lie close to each other, which in turn emphasizes their prediction capabilities within the context of our ground truth data. Middle: Difference of the visual acuity from the previous to the current data point, showing the improvement, stabilization, or aggravation of the VA from the last examination. A threshold of 0.1 logMAR units divides the patient into three groups: \(\Delta \textit{VA}_i < -0.1\) into winners (W), \(-0.1 \le \Delta \textit{VA}_i \le 0.1\) into stabilizers (S), and \(\Delta \textit{VA}_i > 0.1\) into losers (L), whereby i denotes the data point with \(\Delta \textit{VA}_i = \textit{VA}_i - \textit{VA}_{i - 1}\). The first four data points of the ground truth are omitted within the visualization. Bottom: Prediction quality of our model and the ophthalmologist. Shown are the predictions of the patient for our approach to VA prediction and the ophthalmologist, whereas correct () and incorrect () WSL predictions are highlighted. The underlying ophthalmic feature vector data organization for VA prediction is depicted in Table 3.

Text processing and ontology

The challenge of the ophthalmic text processing is given by the heterogeneity and incompleteness of the underlying data itself, which are in turn documented by many different medical doctors. For text processing and ontology creation, the programming language Python with the Natural Language Toolkit (NLTK)33 was utilized.

In order to harmonize our weakly-structured medical texts present within our different databases extracted from electronic health records from over 10 years ranging from 2010 to 2020, we applied a general stemming approach using a snowball stemmer for the German language, also known as the Porter stemming algorithm34. Within computer linguistics, this approach enables an automatic tracing and finally a reduction of words to their stem or root word to allow for a unified text representation. To this end, a set of rules is applied until the current word’s stem is extracted34, for which NLTK provides our language-specific rules. When processing, the stemmer is applied iteratively to every word given within the current medical text. Following the obtained stems, specific ophthalmic and medical rules are applied for further processing.

These rules encompass a set of category-specific rules handling abbreviations, negations, and synonyms as well as orthography and grammar variants and mistakes. For this purpose, recognized medical and ophthalmic text strings are mapped into a unique ontology. In the following, we demonstrate some arbitrary strings from the EHR system’s diagnosis with customized (German) abbreviations and how they are mapped to diseases, whereby “\(*\)” denotes a placeholder for any string that is not handled otherwise. The mapping is case-insensitive and was specifically adjusted for German doctors (Table 1). The results have been qualitatively inspected for reliability and plausibility. A thorough quantitative evaluation of our text processing algorithms is still subject to further investigations and will be part of our future research. Therefore, it is beyond the scope of this contribution to further explain the underlying (German) text processing in detail.

Table 2 Ophthalmic data set overview of our 6 different OCT biomarkers.

Data fusion and cleaning

Our practice management software supports database exports via the BDT file format. Here, we extracted electronic health records from over 10 years ranging from 2010 to 2020, including over 49,000 patients and more than 130,000 examinations. Our six main categories (G, V, O, D, T, and OP) currently span a total of more than 30 subcategories. These include, among others, medication-related information such as information on apoplexy and blood thinning as well as biomarker-related information such as information on central retinal thickness and intraretinal fluid. More than 18,000 IVOMs are available from 2013 to 2020 after a fusion of the EHR and CIS systems’ data. All data were linked to the over 12,000 OCT volumes exported from the OCT system via the patient ID (see also Fig. 2, patient-based data merging).

Subsequently, the obtained merged data table is further processed and cleaned. Firstly, it is cleaned by mapping the arbitrary text strings present in the medical letters to diseases. The medical letters contain several terms and abbreviations for specific diseases as different letters stem from several different doctors (see previous section, Table 1). Secondly, unspecific and invalid terms are revised. For example, the entry “eye side” of treatments is only allowed to have the entries “left” or “right”, while invalid entries such as “-” are removed.

Category-centered data organization and descriptive statistics

After the data fusion and cleaning, the data are focused towards the description of (i) anamnesis, (ii) intravitreal operative medications (IVOMs), (iii) diseases, as well as (iv) visual acuity (Fig. 2). The merged main table is changed towards a patient-centered description. The aim is to record the start and the end dates of anamnesis entries (if available), the diseases, and the IVOM therapy cycles for each patient. We designed our tables via two lines per patient since the diseases are to a large degree eye independent, for which we include both eyes as separate lines in our tables. For the medical doctors, the category-centered tables allow an easier filtering, an easier sorting, and a more compact view of the data for a particular medical information.

Finally, all tables are visualized in the statistical-description module that illustrates for example a single data set. In addition, the tables can also be combined to show cross-table-referenced data correlations. From all available visualizations, including up to 30 combinations of visual acuity and disease statistics, we illustrate in this work due to space limitations the statistics of the aforementioned three diseases, and a disease statistic under the influence of a second disease. As statistical tests, we employ two-sample Student’s t-tests. Furthermore, we quantify the strength of the effect—the increase or decrease in visual acuity—via the standard Cohen’s d metric35. Cohen’s d measures it in standard units, where 0.2 stands for a small, 0.5 for a medium, and \(\ge 0.8\) for a large effect size35. It is calculate via Eq. 1.

$$\begin{aligned} d = \frac{\bar{x_1} - \bar{x_2}}{s} \end{aligned}$$
(1)

where \(\bar{x_1}\) and \(\bar{x_2}\) are the means of the two data sets (patient populations) and s is the standard deviation for the data.

Patient progression modelling

For the OCT biomarker classification and patient progression modelling via visual acuity prediction, a set of prominent as well as conventional approaches from machine learning and deep learning are adapted. The machine learning library TensorFlow with Keras as its front end is utilized, while furthermore, scikit-learn is deployed for all machine learning based models and evaluation procedures. As the extracted medical data from the electronic health records of our patients originates from the documentations of ophthalmologists, these data will be defined as our ground truth. Data are extracted from the documentations as described in Section “Text processing and ontology” and will therefore serve as our training and test data set.

For this purpose, our patient progression visualization and modelling dashboard depicted within Fig. 1 is utilized to enable the annotation of the expected course of the visual acuity on site by ophthalmologists. These annotations are in turn used to obtain a comparison for the prediction capabilities based on our combined data corpus.

Classification definitions and terminology

For evaluation, we deploy the macro average F1-score. The macro average F1-score is calculated via the class-wise F1-scores f, where F denotes the set of all class-wise F1-scores following our WSL classification scheme: \(\textit{macro average F1-score} = {1}/{|F|} \cdot \sum _{f \in F} f\), whereas the class-wise F1-scores are in turn calculated via \(\textit{F1-score} = 2 \cdot {(\textit{precision} \cdot \textit{recall})}/{(\textit{precision} + \textit{recall})}\) with \(\textit{precision} = {\textit{TP}}/{\textit{TP} + \textit{FP}}\) and \(\textit{recall} = {\textit{TP}}/{\textit{TP} + \textit{FN}}\) (TP = classification true positives, FP = false positives, and FN = false negatives).

Training and test setup

Our training setup for visual acuity prediction furthermore includes: the Glorot initializer36 for weight initialization with the Adam optimizer as well as a batch size of the same parameterization. Over all experiments, we conducted our test runs with a randomized data set split ratio of 80/10/10 for training, validation, and test set.

OCT biomarker classification

Within our proposed approach, OCT biomarkers are firstly classified using the provided OCT B-scan images (local, slice-wise classification). These B-scan images are slices through the three-dimensional, scanned back of the eye produced by the OCT scan. The local, slice-wise classifications are then combined in order to classify the whole OCT scan (global, scan-wise classification). To complete incomplete OCT biomarker documentations, our multistage system consists of an ensemble of different models for the local, slice-wise OCT classification, which in turn therefore enables the global, scan-wise OCT biomarker classification. For the scan-wise OCT biomarker classification, the beforehand obtained slice-wise classifications are combined via a random forest classifier as based on our classification scheme. For this combination purpose, the slice-wise obtained classification confidences are utilized and fused to a scan-wise, global class.

Our 6 OCT biomarkers of interest are separated into two states, physiological and pathological, defining two distinct classes. These take the values interrupted (pathological) versus preserved (physiological) for external limiting membrane (ELM) and ellipsoid zone (ellipsoid), as well as present (pathological) versus not present (physiological) for foveal depression, retinal pigment epithelium (RPE), scars, and subretinal fibrosis, respectively. An OCT biomarker data set overview with the available OCT B-scans per OCT biomarker and related classes is shown in Table 2, resulting in a total of 12 data subsets, for which a classification into pathological versus physiological OCT scans is conduced. For our OCT biomarker OCT slice extraction, we determined an intermediate subset of slices with a slice range of 8 to 18 from 25 in total (8..18 out of 1..25). This range has been selected as, in our experience, most information is present within this slice range. Since different OCT scans may possess different original image resolutions, their image resolution was scaled to an initial size of \(256 \times 256\) pixels for ML models and DNNs.

Based on the obtained OCT biomarker classifications, the subsequent VA modelling is realized as a time series prediction using, among others, different ML- and DL-based models such as multilayer perceptrons (MLP)37 and recurrent neural networks38,39, also shown in Table 6. For our classical multilayer perceptron classifier as baseline model, the following parameters were chosen as its configuration: one input, one hidden, and one output layer with a hidden layer size of 100. We utilize the ReLU activation function40,41, the Adam optimizer42 with a standard learning rate of 0.001 and exponential decay rates of 0.9 and 0.999, as well as a batch size of 32. For all remaining parameters, standard values as provided by scikit-learn are applied.

Table 3 Ophthalmic data set overview with utilized data organization for VA prediction.

Visual acuity prediction

To allow a WSL-based grouping of VA values, the logMAR score of each VA value is derived via \(\textit{VA}_\textit{logMAR} = -\log _{10}{} \textit{VA}_\textit{dec}\)43. We define a decimal VA range of 0.04 to 2.0, corresponding to a logMAR range of 1.4 to \(-0.3\). The visual acuity delta (\(\Delta \textit{VA}_i\)) of the examination i is calculated by comparing two adjacent logMAR VA values via \(\Delta \textit{VA}_i = \textit{VA}_i - \textit{VA}_{i - 1}\). A threshold of 0.1 logMAR units divides examination i depending on its progression: \(\Delta \textit{VA}_i < -0.1\) is considered a progression winner, \(-0.1 \le \Delta \textit{VA}_i \le 0.1\) a progression stabilizer, and \(\Delta \textit{VA}_i > 0.1\) a progression losers (see also Fig. 3). The threshold of 0.1 logMAR units has been selected in order to to enable a reliable categorization of progression winners, stabilizers, and losers where an improvement, stabilization, or aggravation in visual acuity is apparent.

Table 3 gives an overview of our ophthalmic data set with its data organization for VA prediction as it is provided to our predictive models. In Table 3, the data organization of an exemplary time window of 4 VA measurements is shown in (a) for the first 10 of 24 lines of the feature data vector, while (b) explains in details all 24 lines of medical features associated with each VA’s feature vector, including “treatment”, “OCT biomarker”, as well as “additional data”. The full data feature vector is then translated to numerical values and feed to the ML/DL models, illustrated with an example in (c). In (c), “\(-1\)” denotes the numeric placeholder when no information is present within the respective data fields. To predict the VA at a given date, machine learning models typically require a fixed data input size, i.e., a matrix or vector of fixed dimensions. In (a), shown are the values for the first time window of size 4 in Fig. 1. Our analyses currently include only IVOM therapies, while within our data vectors, no information on therapeutic interventions such as operations is present currently.

In order to make predictions for time windows of different sizes, we define a matrix with predefined dimensions of 24 rows (see medical feature vector in Table 3b) and 10 columns, which corresponds to 10 time steps or date entries with available information. The minimum time window size is 4, which means that the first 4 of in total 10 columns of the matrix are filled with patient data as exemplary shown in Table 3b. The remaining 6 columns are set to “\(-1\)”, especially when no more temporal information is available. The window is—as far as more temporal data are available—iteratively increased by one, i.e., in each iteration, one column more is filled with values. The subsequent VA is modelled, for which it is determined whether the VA improves, remains constant, or deteriorates (WSL classification scheme). However, at maximum, 10 time steps (columns) are used given that the visual acuity for the 11th time step is known. Formally, all models in Table 6 require a vector as input. Thus, the matrix was reshaped into a 1 \(\times\) 240 vector to be used in model training and testing. The MLP model described in Section “OCT biomarker classification” with its configuration is again utilized as baseline model. Given our MLP model for visual acuity prediction, a meta model is realized that classifies the predicted VA values via our WSL classification scheme. Since the MLP predicts the VA values, our so-called MLP-LDA model utilizes these predictions by further classifying them via a linear discriminant analysis (LDA)44 into our WSL-based classes. To this end, the shown data vectors in Table 3c are extended with the visual acuity predictions of our MLP model in each time step. MLP-LDA processes these extended data vectors while correcting the previously obtained visual acuity prediction of MLP.

For our ophthalmologists’ study, two different sets of ophthalmologist(s) were recruited for a first evaluation study: our main ophthalmologist (ophthalmologist I) and eight other ophthalmologists to further validate the results of ophthalmologist I (ophthalmologist set II). Our motivation was to provide the main ophthalmologist with the full test data set, while group II received randomized subsets of our test set. We also aimed at choosing multiple participants, yet time and effort were limiting factors. All participants were selected from the Department of Ophthalmology at the Klinikum Chemnitz gGmbH in Chemnitz, Germany and they have all subspecialty training in eye diseases. Training of our ophthalmologist I is fellowship-level. The participant group II has a span of training levels (four fellow level, three specialist, and one senior specialist for retinal surgery). Their experience is assumed based on their respective training levels. The main ophthalmologist is author of this study, he has thus potentially a notably higher experience in the field of AMD/DME/RVO as his training level otherwise implies, and the senior specialist is author as well. Within our experiment, the ophthalmologists utilized our dashboard for their predictions (Fig. 1), for which the task was to predict the visual acuity value for the next point in time. The test data set for ophthalmologist I contained 1494 samples, while the eight additional doctors received randomized subsets of the test set of ophthalmologist I, whereby each subset contained between 50 and 100 samples. Participants were presented with the following medical information, which is also visible from our dashboard: the past visual acuity values, central retinal thickness, OCT biomarker documentation, diagnoses (AMD, DME, RVO, and also others), general information (gender and age), IVOMs, and additional medical data. The AI models were provided with precisely the same information (cf. Table 3).

Test results, evaluation, and discussion

The following sections provide first our evaluation regarding the predictive statistics of therapy winners, stabilizers, and losers (Section “Predictive statistics”). To allow the inclusion of OCT biomarkers into the patient progression and VA modelling process, incomplete or missing OCT documentations are completed (Section “OCT biomarker classification”). Given each patient’s medical data, the following patient progression and VA prediction utilizes the resulting OCT biomarker completions (Section “Visual acuity prediction”).

Figure 4
figure 4

(a–c) Real-world winner, stabilizer, and loser distribution (WSL classification scheme) for exudative AMD, RVO, and DME. (d) Distribution for the disease DME under exemplary medical co-factor of an epiretinal membrane. The shown results are based on our disease statistics (Fig. 2), whereas \(\textbf{N}\) denotes the number of eyes for the given number of patients.

Table 4 Summary of the statistical test results regarding the significant deterioration of the visual acuity.

Predictive statistics

For the following predictive statistics, a statistical analysis was conducted for our diseases exudative AMD, RVO, and DME. The progression of the VA was classified into therapy winners, stabilizers, and losers (WSL classification scheme) based on the first and the last VA measurement of each patient (Section “Category-centered data organization and descriptive statistics”). The size of our data corpus and its harmonization as described in Sections “System architecture”, “Text processing and ontology”, and “Data fusion and cleaning” allows different kinds of statistical surveys, e.g., separated according to disease, time periods, and comorbidities. The data originate from a large- and daily-operating medical hospital (German hospital of maximum care level) and thus indicate effects of real-world scenarios.

Our statistical investigations consequently allow us to make statistical predictions under real-life conditions for questions such as “If a patient has exudative AMD, what are the future prognoses for this patient?”. For the three aforementioned diseases, the outcomes are shown in Fig. 4. Especially for AMD, at average, a deterioration of the visual acuity over time is observable. A more fine-grained analysis reveals effects over time since we split the data of a disease into patients with short and longer progression. The time course refers to the time difference (in years) between the first and the last VA measurement of individual patients. The total data of \(\textbf{N} = \text {1050}\) eyes of AMD is now divided into 4 substatistics with different time windows, whereby, for example, the data in the first time bin of under 1 year is about 25%. We found, with regard to longer disease time courses, the proportion of losers increases further till \(\ge 60\)% for time courses of > 6 years and longer. Note that our WSL group definition using \(\Delta \textit{VA}_\textit{logMAR}\) thresholds of 0.1 is fixed for all time windows, which might be regarded as a somewhat harsh criterion for long time scales. We perform two-sample Student’s t-tests to analyze the statistical significance of the deterioration of the visual acuity, shown in Table 4. To avoid thresholding effects, we perform the tests directly at the raw delta logMAR values. We found a strong significant effect for the disease AMD (\(p \le 0.0001\)), no significance for RVO (\(p = 0.0607\)), and a weak significant effect for DME (\(p = 0.0016\)). A full combination of all statistical tests can be found in Table 7. In addition, we employed the Cohen’s d measurement that shows the normalized strength of the effect, i.e., the amount of the increase in the therapy loser fraction (Section “Category-centered data organization and descriptive statistics”). We observed a deterioration of the visual acuity in AMD with a large/medium effect (Table 4), while the other diseases arouse smaller effects. This means that more and more patients are experiencing deterioration of vision over longer periods of time, especially for AMD.

The representation of our diseases in combination with medical co-factors (comorbidities) is shown in Fig. 4d and can be performed as a proof of concept. It illustrates the influence of an epiretinal membrane (ERM) on the disease DME. Yet, if DME and epiretinal membrane occur simultaneously, it becomes apparent that only about 12–25 patients are included in each substatistic and the direct comparison with the DME-only group would not yet stand up to statistical tests (Table 4). For such substatistics, more data will be needed in the future, e.g., by merging several ophthalmic hospitals into a common research data infrastructure.

Table 5 OCT biomarker classification results for the slice-wise as well as the scan-wise OCT biomarker classification.
Table 6 Visual acuity modelling results overview of our approaches with feature vectors containing visual acuity values only/additional medical data in the form of our completed OCT biomarker documentations (Table 3b), and our annotations (ophthalmologists).

OCT biomarker classification

Table 5 shows our classification results for the slice-wise and the scan-wise OCT classification using different prominent approaches from machine learning and deep learning averaged over five runs. In comparison, the selected models DenseNet-20145 and ResNet-152V246 show the best classification accuracies in F1-score with mean classification accuracies of 81.8 and 82.3% for our 6 OCT biomarkers. While the biomarkers RPE and subretinal fibrosis are best classified with the network DenseNet-201, the best results for the four other biomarkers ELM, ellipsoid, foveal depression, and scars are shown by the ResNet-152V2 network. When comparing the results for the different OCT biomarkers, ELM and ellipsoid show the best accuracies to classifying them given the OCT slices, whereas RPE and subretinal fibrosis represent the more challenging OCT biomarkers with reduced accuracies. For the following scan-wise OCT classification, a random forest classifier44 was sufficient in order to achieve high classification accuracies over all biomarkers with single scores of up to 99.9%. These scores were obtained for the biomarkers ELM, ellipsoid, and foveal depression. Finally, we obtain the best resulting mean classification accuracies in F1-score over all OCT biomarkers of 82.3 (slice-wise, ResNet-152V2), and of 98.2% (scan-wise, random forest classifier).

Figure 5
figure 5

VA modelling results’ precision (P) and recall (R) plot for our annotations (main ophthalmologist) as well as our approaches to VA prediction with MLP and MLP-LDA based on our WSL classification scheme (see also Table 8).

Visual acuity prediction

The principles of VA prediction are illustrated in Fig. 3 for an exemplary patient whose first diagnosis was cataract in both eyes and DME in the right eye in 02/2014. Shown is the VA progression over a timespan of 1 year, in which the patient had 6 IVOMs with Eylea and Lucentis. For visual acuity prediction, we consider the subsequences of measured visual acuity values with their additional data for each patient, for which the future visual acuity value is predicted. Our model predicts the \((i + 4)\)th VA value from a time window of the previous four VA measurements, whereby we use a growing time window of size 4 up to a size of 10 VA measurements, e.g., the minimum interval \([i, i + 3]\) with \(i = 0, 1, \ldots , i_\textit{max} - 4\). This approach has been selected to account for the different sizes of visual acuity sequences. A minimum window size is enforced in order to enable a more reliable prediction, whereby patients with an insufficient amount of measurements are not considered. Additionally, a maximum window size was defined as, in our experience, larger window sizes can lead to reduced prediction scores. The last documented VA measurement is defined to be \(i_\textit{max}\). The model uses medical patient with the aforementioned growing window size, which is reformatted as data input matrix as shown in Table 3b. For instance, the 5th VA value will be predicted based on the time interval provided by the 1st to the 4th VA measurement, whereas the 6th VA value will be predicted via the 1st to the 5th measurement. For evaluation, the horizontal lines indicate our thresholds for therapy winners, stabilizers, and losers (see also Section “Predictive statistics”). Finally, a classification based on our WSL classification scheme is carried out for our predicted visual acuity values. The obtained WSL-based classification is then compared to the original classification of our ground truth.

Evaluation principles

Out of 49,000 patients, 7878 patients with VA series of length \(i_\textit{max} \ge 5\) exist within our data, resulting in over 100,000 separate VA series of length 5. This minimum sequence length of 5 VA measurements has been selected as, in our experience, shorter sequences may not allow for a reliable data for the visual acuity prediction. With the three diseases AMD, DME, and RVO, 1496 patients with VA series of length \(\ge\) 5 exist, resulting in 14,026 separate VA series of length 5. For evaluation, all visual acuity values with their time steps are considered. Thus, all predictions as shown in Fig. 3 are utilized, for which all time steps are evaluated regarding their resulting WSL-based classification. The VA-based prediction accuracy is calculated via all VA predictions and related local \(\Delta \textit{VA}_i\). For the modelling process, the completed OCT documentations (Section “OCT biomarker classification”) and the related additional data are retrieved. For this purpose, Table 3 gives an extensive overview of the leveraged data set as well as the related data organization.

Visual acuity prediction results

Table 6 shows our prediction results using different statistical approaches as well as prominent approaches from machine learning and deep learning averaged over five runs. These approaches encompass estimators such as statistical estimators and moving average estimators (MA), regressors, recurrent neural networks, and multilayer perceptrons. Our statistical estimator predicts visual acuity progressions utilizing the statistical distribution of our WSL classification scheme within our train set. The MA estimator averages the given window of VA values, whereas the weighted MA estimator weights recent VA values more strongly. We consider the statistical estimator, MA estimator, and the weighted MA estimator in order to formulate a baseline prediction(s).

Whereas our baseline approaches, our estimators and regressors, result in prediction accuracies of up to 40.8% in macro average F1-score (bagging regressor44), a more realistic setting includes the completed OCT documentations with the additional data shown in our data organization table such as OCT biomarkers. Our MLP-based predictor results in the second highest prediction accuracy, whereas the addition of OCT biomarkers allows their inclusion in the VA modelling process, resulting in an accuracy of 44.6% with an improvement by \(+4.4\)% (Table 6). Therefore, the inclusion of OCT biomarkers allows their modelling as crucial information and influential visual factors when no OCT classifications are provided.

With MLP-LDA, we obtain a final prediction accuracy of 69% (Table 6), which corresponds to an improvement by \(+24.4\)% for MLP-LDA in comparison to MLP. The ophthalmologist reaches a score of 57.8%. Figure 5 gives an overview of our obtained main results in precision and recall for therapy winners, stabilizers, and losers. Considering the class-wise scores, MLP-LDA strikes a balance between all three progression groups, whereas a trade-off between therapy winners and losers with therapy stabilizers is observable in comparison to MLP. Finally, Table 8 shows an extensive VA modelling results overview with confusion matrices and class-wise recall and precision results of all VA modelling experiments with MLP, MLP-LDA, and the human reference annotation results from the ophthalmologist. We conclude that treatment winners and losers are predicted within the same range as the ophthalmologist in both recall and precision, which is a promising result. However, treatment stabilizers are predominantly present when observing the visual acuity values of adjacent time steps. For this reason, an improvement of the VA prediction for stabilizers has to be enforced to realize a (semi-)automated recommender system.

Finally, in order to further validate the annotations of the ophthalmologist, we evaluated the annotations of eight different additional ophthalmic doctors given randomized subsets of our test set. Table 9a shows their mean and standard deviations for precision, recall, and F1-score as well as their overall macro average F1-score. Additionally, their confusion matrices are depicted (Table 9b). We obtain a prediction accuracy in macro average F1-score of \(50.0 \pm 10.7\)%. The minimum and maximum scores are 37.7 and 69.4%, resulting in a range of 31.7%.

Figure 6
figure 6

VA modelling results’ normalized confusion matrices for our annotations (main ophthalmologist) as well as our approaches to VA prediction with MLP and MLP-LDA based on our WSL classification scheme, respectively. Note, each confusion matrix shows a single, randomly selected run.

Table 7 Full statistical test results.
Table 8 VA modelling results overview of the ophthalmologist, MLP, and MLP-LDA.
Table 9 VA modelling results overview of our control experiment with eight different additional ophthalmic doctors given randomized subsets of our test set.

Discussion

Better performance due to the completion of OCT biomarkers. In our work, we developed a multistage system that completes previously incomplete OCT biomarker documentations by means of learning-based approaches, which are then utilized for the following visual acuity prediction. This approach enabled us to provide additional data that were previously not available to be included into the AI modelling process, therefore achieving an improved model performance. The OCT biomarker classification recognizes OCT biomarkers based on the OCT B-scan images for patients where OCT biomarkers were previously not available within the electronic health records, resulting in completed OCT biomarkers (Section “OCT biomarker classification”). Subsequently, these completed OCT biomarkers documentations are exploited in our MLP-LDA system, together with the already existing OCT biomarker information, the visual acuity, and other medical data, to predict the visual acuity (Section “Visual acuity prediction”). The ophthalmologists had therefore only access to the already existing OCT biomarker documentations. They could only utilize these data. Moreover, in larger hospitals, different ophthalmologists are included in the diagnosis. They often have to rely on previous documentations, including previous OCT biomarker diagnoses. In addition, fellow-level ophthalmologists are often included in the diagnosis to create OCT biomarker documentations. Thus, even the existing data might not have the best quality. Some specialists work around these weaknesses by directly analyzing the OCT B-scan images, too. We have performed control experiments to conduct the magnitude of the influence effect from our novel, completed OCT biomarker documentations. The full system achieves a performance of 69.0% F1-score (Table 6). When excluding the completed OCT biomarker documentations, we obtain an accuracy of 62.8%. Hence, the OCT biomarker completion is valid for a noticeable improvement of ca. \(+6\)% F1-score. Within our (semi-)automated context, we therefore see the ability to complete OCT biomarker documentations as a major benefit of our system. Finally, out of over 49,000 patients with overall more than 130,000 examinations, we identified approximately 1500 AMD/DME/RVO patients with about 15,000 relevant examinations. It can be assumed that a learning-based approach such as ours leverages the latent knowledge within our data corpus.

Differences in therapy winner/stabilizer/loser classification. In Table 8, it is shown that our test set consists of mostly patients with progressions of the category therapy stabilizers given the visual acuity values of adjacent time steps (ca. 80%). For this reason, the MLP model has learned that at average a stabilizer is to be expected. Therefore, most classifications for therapy winners and losers are incorrectly assigned to the class of therapy stabilizers. Analogous, our trained ophthalmologist knows of this general distribution within our classification scheme. The MLP-LDA approach has learned a more fine-grained classification than MLP due to the subsequent correction/control stage, for which we observe a shift from mostly classified therapy stabilizers towards therapy winners and losers (Fig. 5 and Table 6). This allows us to achieve a generally improved classification performance in macro average F1-score, which explains the mechanism and possible benefits of our correction stage using MLP-LDA. Subsequently, all ophthalmologists show a sizeable advantage in classifying stabilizers with improved precision, recall, and F1-scores (Tables 8 and 9). In comparison to MLP, it is evident that MLP-LDA excels at differentiating between therapy stabilizers as well as winners and losers, demonstrated by an improvement in macro average F1-score and class-wise precision/recall. Future work could thus focus on therapy loser modelling as therapy losers are the relevant subgroup where a better treatment handling would be important. However, this requires that further therapy options can be reliably modelled for this group to recommend a potentially better therapy option. As mostly therapy stabilizers are present, the ophthalmologist obtains the best accuracy in true positives with 82.2% in comparison to MLP-LDA with 77.2% (see also Table 8). The other eight ophthalmologists score \(70.1 \pm 5.9\)% in true positives, which signals results within the same range. We conclude that, depending on the evaluated progression group as well as the related evaluation metrics, different advantages and disadvantages in prediction accuracies can be observed.

The influence of medical features on the visual acuity prediction accuracy. To understand the influence of single components, we have performed experiments to include or to omit particular medical features. In Table 6, we have chosen a 3-step approach. Firstly, we started with an input of visual acuity values only to our system, where a performance of 40–45% F1-score shows that time and visual acuity history already have a certain prediction potential. To give an idea of the scale, the annotating ophthalmologists achieve 57.8% and \(50 \pm 10.7\)% F1-score given a constant IVOM therapy scheme. Secondly, we added all medical features (cf. Table 3c) to our models, whereby two of our models, MLP and MLP-LDA, benefited from it distinctly. The performances led to a score of about 45% and 69% F1-score, respectively. In the last step, we omitted the OCT biomarker documentations, which resulted in a decrease in prediction accuracy of a few percentage points, thus highlighting the influences of OCT biomarkers. The analyses were conducted separately for the existing OCT biomarkers and for the completed OCT biomarkers (Table 6). The omission/inclusion of OCT biomarkers in combination with the tracking of the evaluation metrics is aiding in quantifying their predictive potentials. However, we did not yet explore the full potential of biomarkers such as DRIL and hyperreflective materials, which are relevant for DME. This will be a part of our future work. The omission/inclusion of particular medical features is similarly relevant for understanding the cues of medical features, and by means of AI technology, deepen the understanding of the field of ophthalmology.

Differences between the AI and ophthalmologists. Although we have tried to create the same starting conditions for the comparison of ophthalmologists and AI, this is not always achievable. In the following, we highlight some of the occurring difficulties and differences. The proposed learning-based system completes OCT biomarkers during its evaluation process, whereas the ophthalmologist does not have access to them. For a fairer comparison, we have to compare the performance of MLP-LDA without the completed OCT biomarkers (62.8% macro average F1-score) with the ophthalmologists (58 and \(50 \pm 10.7\)% F1-score). This is possible since we have also benchmarked MLP-LDA without the completed OCT biomarkers (see also Table 6). Furthermore, the ophthalmologists were not trained on the same number of data vectors as our AI system beforehand, which was trained on about 1200 patients with at least five visual acuity examinations (80% train set out of 1496 total patients). A daily clinical routine might provide some of the highly-experienced ophthalmologists with such numbers of patients, while freshly-educated ophthalmologists might not have seen such a number of patients yet. Our ophthalmologists have different experience and education levels. On the other hand, the ophthalmologists have a medical school or university study over several years, which the AI system, logically, does not have. Patient data in the dashboard view were not provided to them for training but instead only for their visual acuity prediction. As the training levels of the ophthalmologists are heterogeneous, their varying results were also expected to vary in addition to the already expected variety when working with human subjects, accordingly. AI models and ophthalmologists both had to deal with data incompleteness. Humans may experience high performance variability even on the same task in behavioral experiments depending on factors such as their form on the day, attentional state in their mind, motivational level, and mental workload. Finally, the ophthalmologists do have background knowledge on how to interpret time series data. For example, they could calculate or estimate the duration of diabetes from DME diagnoses as a risk factor, which was not directly provided in our case as we provided some information but no duration of diabetes. For longer time series, this knowledge could prove to be an advantage for ophthalmologists. However, background knowledge might also bias ophthalmologists to predict more stabilizers, assuming the distribution within our test set, whereby stabilizers are the largest progression group for our experimental setting (comparing t and \(t - 1\) to determine the resulting WSL classification). However, no ophthalmologist was informed about this distribution within our test set.

Explainable AI and training of human raters. To quote from “Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI” by Arrieta et al.: “... in Machine Learning, the entire community stands in front of the barrier of explainability, an inherent problem of the latest techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI (namely, expert systems and rule based models).”47. We understand that comprehensibility could foster theoretical improvements in the field, which requires to represent our models’ learned knowledge in a human-understandable fashion. However, in the context of MLP and LDA as well as high-dimensional data vectors, we believe this to be a distinct effort, which is not trivial in nature, hence beyond the scope of this work. A training of physicians by using the cues that the models use is therefore not possible at the moment. Even decisions trees who give comprehensible rules can be too complex to derive straightforward rules given high-dimensional data such as ours. An approach to determine the possible influence of single components, i.e., medical features, via their omission/inclusion has been performed in our experiments (Section “Visual acuity prediction”), which gives some comprehensibility on levels. For the image classification of OCT biomarkers using deep neural networks on the other hand, a gradient-weighted class activation mapping (Grad-CAM) could be applied48,49 to reason the regions of interest within the OCT images where neurons were especially active. This will be a part of our future studies. However, the aggregated visualization within a dashboard such as ours could be already utilized to train physician (i) with more patient progressions, (ii) more specific information such as the present type of a disease or treatment, or (iii) lengthy progressions that would be rarely seen otherwise.

Conclusion and outlook

In this contribution, we developed an IT system architecture that aggregates patient-wise information for more than 49,000 patients from different categories of various multimedia data in the form of text and images within multiple heterogeneous ophthalmic data resources originating from a German hospital of maximum care. As the prediction of a patient’s progression is challenging within this real-world setting, our realized workflow allows a first processing of medical patient data to enable an OCT biomarker classification, a visual acuity prediction, as well as a general statistical evaluation and visualization. For this purpose, our developed patient progression visualization and modelling dashboard enables the visualization, annotation, and assessment of patient progressions with a focus on their visual acuity.

The resulting data corpus allows predictive statements of the expected progression of a patient and his or her visual acuity in each of the three diseases AMD, DME, and RVO. Our data reveal that especially exudative AMD results in a notable high amount of therapy “losers” (60% regarding a time span of 3 to 6 years). The result for AMD is significant. Furthermore, we found a weakly significant deterioration of visual acuity for DME, while we found no significant deterioration for RVO. A more fine-grained analysis is able to reveal the influence of medical co-existing factors such as other diseases. As a proof of concept, we exemplary show DME with an epiretinal membrane. Yet, the data situation is still too weak to derive reliable correlations for statistical surveys of comorbidities in combination with different observation time windows.

For the following visual acuity based treatment progression modelling, incomplete OCT documentations are completed by classifying the OCT scans’ slices (OCT B-scans), which in turn allows the classification of OCT scans when only single OCT slices are provided. Based on the obtained OCT slice classifications, a scan-wise OCT classification of the OCT biomarkers ELM, ellipsoid zone, foveal depression, RPE, scars, and subretinal fibrosis resulted in an overall classification accuracy of over 98% in F1-score. Finally, the completed OCT documentations are combined with additional medical data, defining our ophthalmic feature vectors for visual acuity prediction. In comparison to different approaches from machine learning and deep learning, we achieve a final prediction accuracy of 69% in macro average F1-score with 77.2% true positives, while our main ophthalmologist shows a macro average F1-score of 57.8% with 82.2% true positives. In order to further validate these results, we evaluated the annotations of eight different additional ophthalmic doctors given randomized subsets of our test set, resulting in an overall macro average F1-score of \(50.0 \pm 10.7\)% and with \(70.1 \pm 5.9\)% true positives.

However, as the influence of the OCT biomarkers is not yet fully understood, further investigations have to be conducted, for which additional OCT biomarkers as well as their influence for the visual acuity modelling process have to be evaluated. Future contributions can build on these initial results in order to determine an optimal time for a change in medication or therapy. This also encompasses treatment options such as laser coagulation, pars plana vitrectomy, or phacoemulsification with posterior chamber lens implantation. We furthermore aim at extending our approach to include a larger data corpus through distributed analysis across multiple ophthalmic sites. Thus, data quality needs to be ensured via comprehensive evaluations of our medical texts structured by rule- and learning-based NLP methods, which requires further harmonization of the underlying medical terminology. Patient-related data of the different categories available and their relevance for the modelling process have to be further investigated in order to increase the evidence of AI-based modelling approaches to enable future realizations of a (semi-)automated recommender system.