Introduction

Corneal opacities are a major cause of blindness worldwide and are ranked in the top 5 causes of blindness1. The major cause of the corneal opacities is infectious keratitis2, and slit-lamp examinations are the gold standard examination method to not only diagnose but to also identify the causative pathogen in eyes with infectious keratitis. However, the accuracy in identifying the causative pathogen is low even for board-certified ophthalmologists including corneal specialists. Laboratory culture tests are essential for the identification of the causative pathogen but the results can take weeks for the culturing and identification. PCR examinations are also very good but they are not universally available.

The significant and major pathogen categories for infectious keratitis are bacterial, fungal, acanthamoeba, and viral infections such as herpes simplex virus (HSV).

Thus, the purpose of this study was to develop hybrid deep learning (DL) algorithm that can determine the causative pathogen category in eyes with keratitis with a high probability score by analyzing slit-lamp images. To accomplish this, we used facial recognition techniques3 because the images of the faces are also recorded from different angles, different levels of illuminations, and different degrees of resolution.

Using this approach, we determined the probability scores of the pathogen category that was causing the keratitis that can be used for machine learning classifications. This DL-based diagnosis should be able to determine the pathogen category with high accuracy. The identification could avoid inappropriate treatments at the early stage of infection leading to an improvement of the visual outcomes.

Results

To obtain images of infectious keratitis, all of the 669 consecutive cases of suspected infectious keratitis that were referred to the Cornea Outpatient Clinic of the Tottori University Hospital between 2005 August and 2020 December, were assessed for the diagnosis based on the criteria. The top 4 categories of causative pathogens were bacteria, fungi, acanthamoeba, and HSV, and we focused on these 4 categories. The images of 362 cases with a definite identification of the causative pathogen were used for the analyses (Supplementary Fig. 1). Based on the criteria, the 362 cases were identified as belonging to one of the four categories of infectious keratitis.

The mean age of the patients whose images were used was 59.4 ± 21.8 years, and 201 cases (55.5%) were men. Of the 362 cases, 225 cases were bacterial keratitis, 76 were HSV, 42 were fungal, and 19 were acanthamoeba keratitis.

We first developed a DL algorithm based on ResNet50 (Fig. 1a) for diagnosing a single image. To obtain pathogen probability scores for each category of disease for classification, the DL algorithm was trained using the Ring loss-augmented softmax function which is known to be highly effective for large scale facial recognition tasks4.

Figure 1
figure 1

Architecture of deep learning (DL) algorithm to determine the causative pathogen by analyzing slit-lamp images and comparisons of diagnostic accuracy of expert clinicians. (a) Architecture of deep learning algorithm at a development stage based on ResNet50. Input image is classified into ‘bacterial’ or ‘non-bacterial’ (1st classifier). Image classified as ‘non-bacteria’ is then classified into 'acanthamoeba', 'fungal' or 'HSV' (2nd classifier). Nμ, weighted average; HSV, herpes simplex virus. (b) Windows of KeratiTest is shown for the 20th question. Accuracy of answers by clinicians were compared to that by the algorithm. The algorithm at a development stage (a) outperformed all the sessions with clinicians. N = 35. (c) Ensemble architecture of deep learning algorithm based on InceptionResNetV2. Probability scores of each causative pathogen was calculated by feature normalization using softmax with Ring loss. The image that was classified as bacterial was also connected to a second classifier to obtain probability scores of acanthamoeba, fungi, and HSV for second classifiers. Pathogen probability scores, argmax of pathogen probability scores of 2 step classifier, pathogen probability scores of second classifier, and argmax of pathogen probability scores of second classifier, and argmax of pathogen probability scores for fluorescein-stained images were used as feature values for learning by gradient boosting decision tree (GBDT).

The DL algorithm was trained using 1426 images collected before March 14, 2019. The flow chart for the analyses are shown in Supplementary Fig. 2. The diagnostic accuracy of multiclass classification for each category of disease was assessed using 140 single test images which were not used for the training.

To compare the diagnostic accuracy of AI and clinicians, we solicited 35 board-certified ophthalmologists throughout Japan including 16 faculty members specialized in corneal diseases. We assessed their diagnostic accuracy using a diagnostic application software named “KeratiTest” in which the AI algorithm diagnosed the single images (Fig. 1b). When the multiclass diagnostic accuracy was assessed, the algorithm outperformed expert clinicians in all of the session of 20 images (Fig. 1b).

For the diagnosis of bacteria and non-bacteria, the area under the curve (AUC) was 0.82 for the algorithm and 0.58 for the ophthalmologists. For the diagnosis of acanthamoeba keratitis, the AUC was 0.84 for the algorithm and 0.59 for the ophthalmologists. For fungal keratitis, the AUC for the algorithm was 0.78 and for the ophthalmologist was 0.52, and for the diagnosis of HSV, the AUC for the algorithm was 0.73 and that for the ophthalmologist was 0.59. Thus, the algorithm outperformed ophthalmologist for all the causative types of keratitis.

Clinically, diagnosing by slit-lamp examinations is typically made by examining different images including those with different angles of view, different types and levels of illumination, and staining or not staining of the corneas. Thus, increasing the viewing of different images should improve the diagnostic efficiency. Therefore, we used up to 4 different recording conditions as a batch of learning and calculated the probability scores of each pathogen using normalization by Ring loss augmented softmax function as score level feature (Fig. 2). For decision level feature values, the argmax of pathogen probability scores for 2 step classifier, 2nd classifier, and fluorescence 2nd classifier were used (Fig. 1c).

Figure 2
figure 2

Representative images of each causative pathogens with 100% probability scores. Each pathogen probability score for single image was calculated using softmax with Ring loss in InceptionResNetV2 architecture (Fig. 1c) and is shown as confidence. Acanthamoeba image with high confidence shows ring filtrate which is located in the center of the cornea while unaffected corneal lesion is relatively clear and without edema. Image of bacterial keratitis with high confidence shows dense infiltrate with intense corneal edema surrounding the lesion. Fungal image with high confidence shows feathery infiltrate with satellite lesions while surrounding lesion are unaffected. HSV image with high confidence shows marginal ulcer with epithelial defect. ‘bac’, ‘aca’, ‘fun’, and ‘her’ represent bacteria, acanthamoeba, fungi, and herpes simplex virus (HSV), respectively.

To further mitigate uncertainty inherent to the disease condition, all of the above feature values were learned for a score level and decision level fusion using gradient boosting decision tree (GBDT) machine learning algorithm (Fig. 1c).

The final model constructed based on InceptionResNetV2 was evaluated using group K-fold validations for 4306 images (3994 clinical and 312 web images). The overall accuracy of the multiclass diagnosis was 88.0%. The results of the confusion matrix are shown in Table 1.

Table 1 Accuracy of multi-class diagnosis and confusion matrix for infectious keratitis.

We next evaluated the diagnostic accuracy of each category of disease using binary classification of group K-fold validation. The diagnostic accuracy was 97.9% for acanthamoeba, 90.7% for bacteria, 95.0% for fungi, and 92.3% for HSV. When evaluated for diagnostic efficacy using ROC analysis, the AUC for acanthamoeba was 0.995 (95% CI: 0.991–0.998), for bacteria was 0.963 (95% CI: 0.952–0.973), for fungi was 0.975 (95% CI: 0.964–0.984), and for HSV was 0.946 (95% CI: 0.926–0.964) (Fig. 3).

Figure 3
figure 3

Receiver operating characteristic analysis of hybrid deep learning-based algorithm. (a) Pathogen probability scores were calculated by softmax with Ring loss in InceptionResNetV2 architecture. The pathogen probability scores and argmax values of the pathogen probability scores of second classifiers, two step classifiers, and fluorescein-stained images were used for learning by Gradient Boosting Decision Tree in a batch of up to 4 serial images, and validated by group K-fold evaluation by the final algorithm with Gradient Boosting Decision Tree. The diagnostic accuracy of the binary classification was assessed for the area under the curve (area, AUC). AUC showed high diagnostic accuracy for all the causative pathogens. (b) The 4306 images were randomly divided into 3882 training images and 424 testing images so that different images of the same eyes were in either the training or the testing group and not in both. The algorithm was initialized and retrained using the training images and assessed for the AUC using test images in batch of up to 4 serial images. The AUC had high diagnostic accuracy for all the causative pathogens. The AUC had high diagnostic accuracy. Some decrease of AUC was observed for bacteria and fungi.

To validate the robustness of the algorithm to an unknown dataset, the algorithm was initialized and retrained using the training images. For this, all of the 4306 images were randomly divided into 3882 training images and 424 testing images so that different images of the same eyes were in either the training or the testing group but not in both.

Of all of the 4306 images, 1314 images were fluorescein stained. In the 3882 training data set, 1190 was fluorescein stained. In the 424 testing images, 124 images were fluorescein stained.

We then calculated the diagnostic accuracy of each category of disease using binary classification of the test data set. For acanthamoeba, the diagnostic accuracy was 96.7%; for bacteria, the diagnostic accuracy was 77.6%; for fungi, the accuracy was 84.2%; and for HSV, the accuracy was 91.7%. The AUC for acanthamoeba was 0.995 (95% CI: 0.989–0.999), the AUC for bacteria was 0.889 (95% CI: 0.856–0.917), the AUC for fungi was 0.889 (95% CI: 0.855–0.920), and the AUC for HSV was 0.956 (0.933–0.974) (Fig. 3).

To further assess the robustness of the algorithm to presence or absence of fluorescein staining, the algorithm was initialized and retrained using the 2692 training images without fluorescein staining. The algorithm tested for 300 images without fluorescein staining using a binary classification. The diagnostic accuracy was 94.3% for acanthamoeba, was 77.0% for bacteria, was 83.0% for fungi, and 93.7% for HSV. The AUC was 0.991 (95% CI: 0.982–0.998) for acanthamoeba, was 0.873 (95% CI: 0.829–0.911) for bacteria, was 0.856 (95% CI: 0.805–0.899) for fungi, and was 0.932 (0.891–0.967) for HSV (Supplementary Fig. 3). This indicated that the decrease in the diagnostic efficacy was minimal although the images of the training images were reduced.

Then, we tested the universality of the algorithm and robustness to low resolution images. Web images are universal for their availability but typically have low resolution, which may compromise the effective optimization because of the low norm features. Therefore, the web test data set was used to assess the effectiveness of the algorithm using binary classification. For acanthamoeba, the diagnostic accuracy was 91.7%; for bacteria, the diagnostic accuracy was 83.3%; for fungi, the accuracy was 88.9%; and for HSV, the accuracy was 97.2%. The AUC for acanthamoeba was 0.998 (95% CI: 0.990–1.0), the AUC for bacteria was 0.913 (95% CI: 0.794–1.0), the AUC for fungi was 0.969 (95% CI: 0.903–1.0), and the AUC for HSV was 1.0 (0.999–1.0). This indicated that the web image classification outperformed the classification of overall test images.

To understand the steps of diagnosis in gradient boosting decision tree (GBDT) algorithm, the first 4 decisions trees are shown in Fig. 4. The first tree used the bacterial probability scores of different images and set different thresholds for the classification. For the second tree, the images were classified as acanthamoeba/bacteria and fungus/HSV. Then, this step was repeated or the fungal probability score was applied for classification. The third tree classified HSV and others, this step was repeated, or fungal probability score was applied. The fourth decision tree classified acanthamoeba and others, then applied the HSV or acanthamoeba probability score for the classification.

Figure 4
figure 4

Sequence of decision trees in gradient boosting decision tree (GBDT) algorithm. Deep learning derived probability scores (acanthamoeba: aca_, bacteria: bac_, fungus: fun_, HSV: her_) and argmax of the pathogen probability scores (Classifier2step_) were shown utilized for effective classification by GBDT. Acanthamoeba, bacteria, fungus, and HSV were coded as 0, 1, 2, 3 for Classifier2step_. Numbers following “_” (1–4) indicate serial number of images for the same batch. First decision tree uses bacterial probability scores. Second decision tree classifies acanthamoeba/ bacteria and fungus/HSV and uses fungal probability score. Third decision tree classifies HSV and uses the fungal probability score. Fourth decision tree classifies acanthamoeba and uses the probability scores of HSV and acanthamoeba.

Then, we assessed the importance of feature values and their interactions using Xgbfir (https://pypi.org/project/xgbfir/, Supplementary Fig. 4). When the total gain was used as an importance score, the bacterial probability score (bac_1) had the highest importance score, followed by argmax of importance score (Classifier2step_1), fungal probability score (fun_1), and acanthamoeba (aca_1) (Supplementary Fig. 4). These findings indicate the importance of the bacterial probability score. In contrast, the pathogen classifiers for fluorescein staining (FluoSecClassifier as Fluorescence 2nd classifier, Fig. 1c) were much lower in their importance, suggesting their role as a complementing pathogen probability score thus mitigating the uncertainty of fluorescein staining.

Then, we assessed how effectively each pathogen probability score was in classifying pathogen categories in the GBDT. Histograms of frequency of split values for pathogen probability scores are shown in Supplementary Fig. 5. The frequency of bacterial probability score had an intense accumulation of cut-off values for 0 (0%) and 1 (100%) compared to the intermediate probabilities between them. This characteristic distribution had a good signal to noise ratio. Other pathogen probability scores also had similar characteristics which supports the validity of the pathogen probability score for classification. In addition, split values for all the 4 pathogen probability scores were more frequent for around 0 (0%) than 1 (100% probability). This indicated that the exclusion diagnosis was used more frequently for the classification.

Discussion

Recent advances in DL technology in the ophthalmic field has allowed rapid and accurate diagnosis of several retinal diseases. These advances have led to the predictions of the prognosis, and they have also identified systemic markers of the disease. Importantly, the diagnostic performance of DL algorithms was equivalent to or even surpassed the diagnostic abilities of trained clinicians.

Currently, the reported DL algorithms for the analyses of anterior segment slit-lamp images appear to be developing with the intension of screening common diseases as a substitution of clinicians. An inception-based algorithm developed by Gu et al. classified a broad category of diseases including cataracts, neoplasms, non-infectious and infectious disorders, and corneal dystrophy5. This type of AI does not need to surpass the capabilities of clinicians but be equal to them. Another type of AI has been developed to surpass clinicians’ ability of diagnosis, however, reports on this type of AI for corneal diseases has been scarce.

Predicting the causative pathogen in infectious keratitis is one such representative challenge which needs to surpass the clinicians’ ability. Because of its vision-threatening nature, prompt and accurate diagnosis will benefit the patients and clinicians.

We calculated the pathogen probability scores after feature normalization using a DL algorithm, and then constructed a diagnostic algorithm using GBDT, a hierarchical series of decision trees. (Fig. 4) GBDT is a machine learning algorithm, which uses successive series of decision trees for learning. In GBDT, the coefficient or weight of the first tree is adjusted by the second tree, which is further adjusted by a third tree, and so on. GBDT is well recognized for its high accuracy and efficacy in classification problems, and it has been used in many AI competitions including Kaggle. Thus, the learning of an effective classification algorithm should help clinicians to diagnose more accurately.

In the decision-making process implemented by GBDT (Fig. 4), we found that bacterial probability score for the initial diagnosis was the most important decision (Fig. 4, 1st tree). For a correct diagnosis, the use of combinations of bacterial probability scores are indicated by the GBDT at the first stage. A different set of probability scores can augment the information to be learned because different illuminations, angles, or fluorescein staining serve as complementary roles. This is also similar to the clinical decision-making process.

Clinically, an alternative of bacterial probability score can be obtained by laboratory testing, including the outcomes of the culture and smear tests. GBDT can also manage these important features together if necessary to improve the diagnostic accuracy of the algorithm.

The second tree in the GBDT classified fungal and HSV keratitis using the fungal probability score (Fig. 4, 2nd tree) Then, the 3rd tree classified the fungal keratitis from the HSV suspected image again using the fungal probability score. Clinically, this diagnostic process is facilitated by calcofluor or fungiflora staining of the smear, considering its specificity. However, in our hands, the incorporation of staining into our slit-lamp images based on the GBDT algorithm did not appreciably improve the overall diagnostic accuracy.

The fourth tree first classifies acanthamoeba keratitis (Fig. 4), then rules out possibility of HSV infection using the HSV probability score. Non-acanthamoeba images are reexamined using the acanthamoeba probability score. This process illustrates the differential diagnosis of acanthamoeba and HSV. For example, in the early stage of acanthamoeba keratitis, pseudo dendritic lesions are often observed masquerading as herpetic keratitis. This leads to improper use of antiviral drugs or steroids. However, the acanthamoeba probability score of the slit-lamp images represented the characteristics of acanthamoeba infection well, and high AUC was obtained (Fig. 3).

In the diagnosis of fungal keratitis, a relatively lower AUC was obtained (Fig. 3). The GBDT indicated requirements for differential diagnosis from HSV in the second tree.

For diagnosis of HSV infection, real-time PCR is a very effective examination6. Thus, its incorporation to GBDT as another feature characteristic should significantly improve its diagnostic accuracy although the availability of PCR is limited in most clinical practice.

Generally, the diagnostic accuracy of identifying the causative pathogen by slit-lamp examinations is low for the general ophthalmologist. It was surprising to learn the low diagnostic accuracy by expert ophthalmologists (Fig. 1b).This was also true for corneal specialists7. In our setting, the accuracy of identification of the four categories of pathogens averaged about 40% for board-certified ophthalmologists. This reflects the difficulty of identifying the causative pathogen in the real-world setting in a tertiary referral hospital.

For corneal diseases, the available literatures on DL algorithms are still very limited5,7. This is in marked contrast to the abundance of retinal imaging AI. Compared to retinal images, the development of anterior segment image AI is hampered by several difficulties arising from differences in the acquisition of the images as stated earlier, and the large number of clinical signs that need to be learned8. For example, when infectious keratitis images were assessed, the performance of a well-established DL framework, VGG16, trained for whole image was insufficient and the overall accuracy remained at 55.24%7.

There are several factors that might explain why the determination of the causative pathogens was so difficult by examinations of the slit-lamp images alone. One was the difficulty in extracting sufficient information from one image9. Another difficulty arises from difference in illuminations or recording angles. This is in marked contrast to the imaging of the fundus in which the images are obtained at the same angle with similar quality.

To overcome such difficulty, several approaches have been used to improve the accuracy of the classifications. One approach is the patch level feature learning. Li et al. reported segmentation of the anatomical structures and annotations of the pathological features for deep learning8. They used 54 pathological features, including the presence of corneal edema, ulcer, corneal opacity, neovascularization, hypopyon, pterygium, and cataract8.

Xu applied patch level learning for classification of infectious keratitis, that was bacterial, fungal, and herpetic stromal keratitis. For this, infectious lesion, conjunctival injection, and anterior chamber inflammation were annotated by manual drawings7. Using the patch level classification outcomes, the accuracy was 52.5% for VGG167. To improve classification accuracy, smaller lesions were randomly sampled from each patch, and the resultant sequence of smaller lesions were used as sequential features for a long short-term memory algorithm. Using the inner-outer sequential order patch algorithm, the accuracy of classification of bacterial keratitis was improved to 75.29%, and the AUC was 0.92.

However, the patch level learning model has some drawbacks. One significant drawback is the requirement of manual drawing of the patch identification by clinicians. This requires large efforts by examiners and may cause some bias for patch detection. Another is the issue of universality or robustness to low resolution images. For example, it remains unclear whether sequential order patch algorithm can perform equally well for fluorescein-stained images. In addition, their softmax based calculations are not robust for low resolution images, photographs obtained with different angles of illumination, or distractors4,10.

Deep learning-based recognition has been used for many practical applications. Face recognition is one important field of security. However, face recognition is a challenging task, because few samples per individual are available for training, and image quality of face or their angles or illumination differ. To overcome this problem, feature normalization using Ring loss was developed4. This allowed the normalization of feature characteristics with the norm constraint of the target. Generally, the preservation of convexity in loss function is known to be important for effective optimization of the network. Because Ring loss maintains convexity of softmax function, effective learning is achieved4. Moreover, Ring loss approach has been robust to numbers of distractors, lower resolution, or extreme pose (angle) images. Thus, Ring loss with softmax is a simple approach and does not need meticulous annotation by manual drawings.

Generally, the integration of multiple modalities can improve the classification efficacy11. This can be conducted at the score level or the decision level. The score level and decision level fusion scheme has been shown to improve the discrimination efficacy greatly in the field of multi-biometric verification such as for authentication in banking11. To improve the classification efficacy in this study, the probability score and decisions steps were integrated using GBDT which was versatile and efficient.

There are some limitations in our study. Our algorithm was developed based on more than 4000 images, however the case numbers may still be limited and may not be applicable to geographic regions which have epidemiologically different pathogenic species. Another limitation is that our algorithm classified four categories of pathogens. Based on therapeutic purposes, this classification appears appropriate. However, we are aware that some pathogens have clinical characteristics that resemble that of other organisms. For example, the clinical characteristics of mycobacterium infection is somewhat similar to fungal keratitis. This difficulty can be overcome by training with more detailed classifications which can be easily implemented by our algorithm.

In conclusion, we have developed an AI algorithm which can identify the causative pathogens of infectious keratitis. This algorithm outperformed the accuracy of clinicians. The development of this DL algorithm is important and may become the basis for future development of auto-diagnosis by slit-lamp as well as establishment of efficient tele-medicine platform for anterior segment diseases.

Material and methods

Demographics of patients

Clinical images were collected from 362 consecutive patients diagnosed with infectious keratitis caused by four categories of pathogens; bacteria, acanthamoeba, fungi, and HSV. All of the patients were examined between 2005 August to 2020 December in the Tottori University Hospital.

We collected the images from the medical records that were recorded by a digital photography system attached to the slit-lamp. Photographs were taken with slit or diffuser illumination, or with blue light illumination to view fluorescein-stained corneas. Various slit lamps (SL130 (ZEISS), and SL-D7 (Topcon), and cameras (EOS kiss × 7 (Canon), α6000 (Sony), and D90 (Nikon)) were used to ensure the diversity of the training samples. The images were obtained from eyes with active diseases at different stages of the natural course of the disease process. Representative images from the 362 consecutive patients are shown in Supplementary Fig. 1.

Diagnosis of infectious keratitis for inclusion into enrolled clinical images

The diagnosis of the infectious keratitis was made by the clinical characteristics12,13, responsiveness to appropriate drug treatment, and detection of specific pathogens. The patients with data meeting these criteria were studied.

A confirmation on the responsiveness to a drug treatment was determined after standard treatment. For bacterial, fungal, and HSV keratitis, antibiotics, anti-fungal drugs, and anti-viral drugs, respectively, were used for the treatment. Antifungal drugs and chlorhexidine and/or polyhexamethylene biguanide were used to treat acanthamoeba infections.

To diagnose the cause of the infections, corneal samples were collected from all of the patients. To confirm the diagnosis, bacteria were identified by using one or more of the following criteria; detection of bacteria in Gram stained smears, positive bacterial culture, and quantification of bacterial DNA using broad range PCR for 16S r-DNA14. Amplified 16S r-DNAs were sequenced when necessary for the identification of the bacteria at the species level.

A diagnose of fungal keratitis was made by the detection of hypha in the smear by fluorescent staining with Fungiflora Y or a positive fungal culture15,16. The clinical characteristics included dry-appearing infiltrates with feathered margins and multifocal infiltrates as satellite lesions. The causative bacteria and fungi are listed in Supplementary Table 1.

To diagnose acanthamoeba keratitis, the clinical characteristics and either the detection of cysts in the smear by staining with Fungiflora Y, positive acanthamoeba culture, or quantification of acanthamoeba DNA using real-time PCR were required17,18. The clinical characteristics include pseudo dendrites, radial keratoneuritis, multifocal stromal infiltrates, and ring infiltrates in the advanced stages19.

For the diagnosis of HSV infection, real-time PCR was used in all the cases to determine the HSV DNA levels6. In addition to the typical clinical findings of HSV keratitis of dendritic or geographical keratitis, atypical stromal keratitis or necrotizing keratitis cases were included based on the outcome of real-time PCR.

The final diagnosis was made after all the outcomes of the tests were obtained and patients were successfully treated. Images from cases with mixed infections were excluded from the analysis. To assure a correct diagnosis, three independent observers reviewed the medical records, and a uniform agreement was made in all cases.

The protocol for this study was approved by the Tottori University Ethics Committee. All of the procedures used in this study conformed to the tenets of the Declaration of Helsinki. Informed consent was obtained from all the participants.

Deep learning algorithm

For effective slit-lamp diagnosis of infectious keratitis, illumination by diffuse light and cobalt blue filtered light after fluorescein staining is required. Therefore, slit-lamp images with these illuminations were used to develop the DL algorithm.

To classify the images of infectious keratitis, we used a convolutional neural network (CNN) as the DL algorithm. The CNN was pretrained by transfer learning using the ImageNet image database. We used two types of pretrained CNN models: ResNet-5020 and the InceptionResNetV221. The training and validation data were provided with the Group K-Fold approach.

The final model was constructed based on InceptionResNetV2 (299 × 299 resolution of the images) using 4306 images made up of 3994 clinical and 312 web images. The AI model was trained with 80% of a randomly selected slit-lamp images and validated using the remaining 20% of different ID patients’ images (group K-fold validation). The calculated accuracy by Group K-fold indicated the accuracy of the samples which were not used for the model construction. Different images of the same eye were not used for the calculation of the accuracy.

The learning process of the DL algorithm was facilitated by balanced data. To compensate for an imbalance in the number of images in the four categories, a hierarchical CNN model was constructed. An overview of the approach is shown in Fig. 1. The first CNN classifier estimated whether an image was from an eye with 'bacterial' or 'non-bacterial' infection. Then, the second classifier estimated the probability scores of 'acanthamoeba', 'fungal' or 'HSV' for the image classified as 'others' (Fig. 1a).

When the first classifier answer was “bacteria”, the image was also directly connected to the second classifier to obtain the prediction probability scores of acanthamoeba, fungi, and HSV to be used for feature values for second step classifier (Fig. 1c).

For calculation of probability score of each pathogen and activation function, softmax, was trained using Ring loss for feature normalization, which constrained the norm of deep feature vectors.

Using the probability score of each pathogen, the argmax of pathogen probability scores for two step classification, second step classification, and second step classification for fluorescent stained image, were calculated as decision level feature values.

For the machine learning algorithm, GBDT (https://xgboost.readthedocs.io/en/latest/#) was trained using all of these feature values in batch of up to 4 serial images with different angles, illuminations, or staining (Fig. 1c).

The universality of the images and robustness to low resolution image is another important issue in determining the applicability of the DL algorithm. Therefore, we searched for infectious keratitis images of the four causative categories by an internet search of publications using keywords. (Supplementary Table 2) The resultant 312 images were also used for the development of the algorithm22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103. (Supplementary Table 3).

Image evaluations by clinicians and validation of deep learning (DL) algorithm

The sequential CNN algorithm (Fig. 1a) was assessed for its performance of 1426 images collected before 2019 March 14. To understand the diagnostic difficulties of processing the images, the application software, “KeratiTest” was created.

KeratiTest

The KeratiTest used 140 single test images which were not used for training or validation of the algorithm. The KeratiTest presented 20 randomly selected photographic images to the application users (clinicians) and prompt answer for a single image which was obtained by either a slit-lamp or diffuser illumination, or cobalt blue for fluorescein staining but not a combination of them (Fig. 1b). When 20 random sequential images were answered by the users, the KeratiTest summarized the accuracy score, the time required to answer, and compared the performance of humans to that of AI for that set of images. Thus, in the KeratiTest, clinicians played against the algorithm, and competed for the diagnostic accuracy of each image.

Statistical analyses

To assess the diagnostic performance, the receiver operating characteristic analysis was used to calculate the area under the curve (AUC) with 95% confidence intervals.