Introduction

Myopia is increasingly prevalent worldwide and has become a serious challenge for public health. The global prevalence of myopia (≥–0.5 D) is approximately 2 billion and is predicted to be 4.76 billion (49.8% of the world population) by 2050 [1]. At the same time, high myopia (≥–6.0 D) has become increasingly prevalent in recent decades, especially in Asian countries, and has developed from an earlier age [2]. Pathologic myopia (PM) is a major cause of severe visual impairment and is defined as a special category of myopia associated with excessive axial elongation leading to structural changes in the posterior segment of the eye, such as posterior staphyloma and myopic macular degeneration (MMD), and loss of best corrected visual acuity in International Classification of Diseases 11th Revision (ICD-11) [3]. PM has been estimated to affect 3% of the global population and will lead to great potential productivity loss and a high economic burden on the healthcare system [4]. As a consequence, it is necessary to timely identify PM eyes and prevent the progression of visual impairment.

Recent studies have proven high accuracy, sensitivity (SEN) and specificity (SPE) of artificial intelligence (AI) systems integrated in ophthalmology imaging, especially the subfield of deep learning (DL). Multiple successful algorithms have been developed for screening and assisted diagnosis of diabetic retinopathy (DR), glaucoma and age-related degeneration (AMD), and myopia [5]. The current applications of AI in myopia cover a variety of aspects, including the diagnosis and classification of PM, prediction of progression and guidance of refractive surgery; meanwhile, imaging modality with which the algorithms were developed has also developed from fundus images to optical coherence tomography (OCT) images.

There are still relatively great variations among studies in development procedures, databases, sample resources and many aspects of methodology. Thus, a detailed assessment of AI performance is needed to quantify the overall accuracy and generalizability and identify the confounding factors of the findings. Recently, meta-analysis and systematic reviews about the diagnostic performance of AI in detecting AMD, glaucoma and diabetic macular oedema (DMO) have been published, while there is still no comprehensive investigation on the performance of AI for the detection of PM [6,7,8].

Aims of the study

We conducted this meta-analysis and systematic review to evaluate the overall performance of AI-based models in detecting PM and PM-related CNV based on fundus and OCT images, and explore the underlying factors affecting the accuracy and acceptability of algorithms and discuss the limitations and future steps of AI applications in PM.

Methods

The protocol for this systematic review was registered in PROSPERO (CRD42022379136) and this review was conducted according to the PRISMA statement recommendations.

Search strategy and selection criteria

We searched PubMed, Scopus, Embase, Web of Science and IEEE Xplore for eligible studies published up to Dec 20, 2022, using the combination of search terms associated with PM (e.g., myopia, high myopia and pathologic myopia) crossed with search terms associated with AI (e.g., artificial intelligence, machine learning and deep learning) in the full text. Full search terms were listed in online Supplementary Appendix 1. We also searched the reference lists of included literature to identify potentially eligible studies. The language was limited to English.

Two researchers (HL, JRZ) independently screened the titles and abstracts for eligible literature according to the selection criteria. The eligible studies were further selected with a full-text review after removing duplications. The inclusion criteria were as follows: (1) journal articles or conference papers reporting the primary outcome of the performance of the AI algorithm in the detection of patients with PM; (2) the definition or reference standard for PM were clearly defined; (3) a clear description of the procedure developing algorithms and detailed information about the database were reported; (4) necessary data or evaluation indices were reported to calculate the absolute numbers of true positive (TP), false positive (FP), false negative (FN), and true negative (TN), such as SEN, SPE, accuracy and area under the receiver operating characteristic curve (AUC).

The exclusion criteria were as follows: (1) publication forms of case reports, reviews, comments, letters and editorials; unpublished or ongoing research; (2) studies that detected PM based on imaging methods other than fundus or OCT images; (3) studies that did not report necessary data of the primary outcome.

Risk of bias assessment and data extraction

Quality assessment of eligible articles was performed by two reviewers (YZ, HL) independently using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool. Any disagreement was resolved by discussion with a senior researcher (XBY) for consensus. The QUADAS-2 tool consists of 4 aspects of assessment: patient selection, index test, reference standard, and flow and timing [9]. All included studies were evaluated for the risk of bias for all 4 aspects and the applicability for the former 3 aspects. The risk of each study was classified into low, high and clear risk of bias, and studies with a high risk of bias or low quality were excluded from our study.

Data were extracted in all eligible full-text studies by two reviewers (YZ, JNW) independently, and the following data were collected if available: the first author; country; publication year; characteristics of datasets (dataset type; total number of images; imaging modality); characteristics of algorithms (types of algorithms, outcome of classification); evaluation indices of the algorithm accuracy derived from internal or external validation datasets, including SEN, SPE, accuracy, AUC, positive likelihood ratio (PLR), negative likelihood ratio (NLR) and diagnostic odds ratio (DOR). The results from different validation datasets in the same study were considered independent data. If the data of evaluation indices were insufficient to calculate the 2-by-2 table for the outcomes of validation, the study was not included in the meta-analysis but only for literature review.

We defined PM as eyes with maculopathy based on fundus images not less than category two or with “plus” features according to Meta-analysis for Pathologic Myopia (META-PM) study classification [3, 10]. Another system considering the three most crucial myopic lesions was the atrophy, traction, and neovascularization (ATN) grading system based on OCT images (Supplementary Appendix 2) [11]. In particular, as choroidal neovascularization (CNV) is a leading cause of vision impairment of PM and should receive timely referrals, we further evaluated the performance of the included algorithms in the detection of CNV in PM eyes (if available).

Statistical analysis

We used the RevMan 5.3 platform (Cochrane Collaboration, Denmark) to conduct quality assessment for all included studies. Next, Stata version 17.0 MP (StataCorp) was applied to perform all the analyses, and a 2-tailed P < 0.05 was considered statistically significant. We applied random-effects models to combine the included studies. The pooled quantitative analysis of indicators for diagnostic performance was performed, including SEN, SPE, PLR, NLR, and DOR, with results shown in forest plots with 95% confidence interval (CI). The I2 statistic was used to assess heterogeneity among studies (25–49%: low heterogeneity; 50–74%: moderate heterogeneity; and more than 75%: high heterogeneity). To avoid threshold effects, we adopted a hierarchical summary receiver operating characteristic (HSROC) model to assess the relationship between SEN and SPE and plotted the summary receiver operating characteristic (SROC) curves with 95% CIs and prediction regions.

Meta-regression was performed to investigate the reasons for the heterogeneity among studies. For subgroup analysis, the following covariates were considered: research regions (developing countries and developed countries); different types of validation datasets (internal and external validation datasets); imaging modalities (fundus and OCT images); types of datasets (public and hospital datasets); and total number of images (<5000 and ≥5000). The direction of subgroup analysis was not a priori due to unclear hypothesized direction. Furthermore, we conducted sensitivity analysis to estimate the robustness and reliability of our analysis and assessed publication bias with Deek’s plot. The sensitivity analysis was the repeat of the primary meta-analysis, after excluding studies with high or unclear risk of bias.

Results

Search results and study characteristics

Initially, our literature search identified 1036 studies, and 587 studies were screened after the removal of duplicated records. Figure 1 shows the flowchart of the literature eligibility process. Finally, 22 studies were included for systematic review [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33], and 14 of them were included for quantitative meta-analysis [12, 13, 16, 19,20,21,22,23, 26, 27, 29,30,31, 33].

Fig. 1: PRISMA 2020 flow diagram of study selection.
figure 1

Flow diagram showing the study selection process and reasons for exclusion.

The characteristics of all eligible studies are summarized in Table 1. In total, 348,861 fundus images and 22,560 OCT images were used for training, testing and validation. Of all included studies, SEN and SPE ranged from 80.0% to 98.7% and from 79.5% to 100.0% for PM detection, respectively. Two categories (PM and non-PM) were exported as the primary outcome in 14 studies (63.6%); 5 categories (META-PM) of PM were exported in 4 studies (18.2%); and 3 categories (ATN) of PM were exported in 1 study (4.6%). The remaining 6 studies (27.3%) identified specific PM-related lesions (CNV, myopic traction maculopathy, retinal detachment, etc).

Table 1 Characteristics of the included studies for review.

Most studies (n = 20, 90.9%) applied convolutional neural network (CNN) to develop algorithms, of which 12 studies used ResNet. There was also 1 study using support vector machine (SVM) and 1 study using Adaboost. 16 studies (72.7%) obtained images from hospitals, and 6 studies (27.3%) from public databases, of which the PathologicAL Myopia (PALM) database was the most frequently adopted public database (n = 4, 18.2%).

Risk of bias assessment and publication bias

We assessed the quality of all included studies using the QUADAS-2 tool, and the results are presented in Supplementary Appendix 3. 7 studies (31.8%) were graded as having a low risk of bias in all 4 domains [16, 18, 20, 26,27,28, 33]. 12 studies (54.5%) had at least one domain with applicability concerns [14, 15, 17, 19, 21, 22, 24, 25, 29,30,31,32]. For patient selection, 12 studies (54.5%) were graded as having an unclear risk of bias because of the lack of a clear description of public datasets, and 12 studies (54.5%) had unclear applicability concerns due to unavailable composition information. For the index test, most studies (n = 16, 72.7%) had a low risk of bias and concern of applicability, and only 6 studies (27.3%) were graded with an unclear risk of bias due to underlying data overlap among datasets. For the reference standard, the risk of bias and concern of applicability were low in all included studies. Finally, for the flow and timing domain, 8 studies (36.4%) had unclear risk of bias considering the unclear construction procedure of public datasets. No publication bias existed (P = 0.10) by Deek’s funnel plot asymmetry test, shown in Supplementary Appendix 4. Furthermore, we assessed and confirmed the low credibility of our meta-analysis using Instrument for assessing the Credibility of Effect Modification Analyses (ICEMAN), shown in Supplementary Appendix 5.

Meta-analysis for the performance of AI in PM and PM-CNV detection

For the detection of PM, the forest plots of SEN, SPE and 95% CIs for the included studies are shown in Fig. 2A, B [13, 16, 19, 20, 22, 23, 27, 29,30,31]. Using the HSROC model, we obtained the SROC curve with a 95% confidence region and prediction region (Fig. 2C). The summary AUC was 0.99 (95% CI: 0.97 to 0.99), and the pooled SEN, SPE, PLR, NLR, and DOR were 0.95 (95% CI: 0.92 to 0.96), 0.97 (95% CI: 0.94 to 0.98), 28.1 (95% CI: 15.8 to 50.2), 0.06 (95% CI: 0.04 to 0.08), and 495 (95% CI: 243 to 1008), respectively. For the detection of PM-CNV, the forest plots for the included studies and the SROC curve plot are shown in Fig. 3 [12, 13, 21, 26, 33]. The summary AUC was 0.99 (95% CI: 0.97 to 0.99), and the pooled SEN, SPE, PLR, NLR, and DOR were 0.94 (95% CI: 0.90 to 0.97), 0.96 (95% CI: 0.94 to 0.98), 25.9 (95% CI: 16.1 to 41.7), 0.06 (95% CI: 0.03 to 0.10), and 435 (95% CI: 220 to 860), respectively.

Fig. 2: Forest plots and summary receiver operating characteristic (SROC) curves for the performance of artificial intelligence for the detection of pathologic myopia.
figure 2

A The pooled sensitivity was 0.95 (95% CI: 0.92 to 0.96). B The pooled specificity was 0.97 (95% CI: 0.94 to 0.98). C The pooled area under the summary receiver operating characteristic (SROC) curve was 0.99 (95% CI: 0.97 to 0.99).

Fig. 3: Forest plots and summary receiver operating characteristics (SROC) curve for the performance of artificial intelligence for the detection of pathologic myopia-related choroidal neovascularization.
figure 3

A The pooled sensitivity was 0.94 (95% CI: 0.90 to 0.97). B The pooled specificity was 0.96 (95% CI: 0.94 to 0.98). C The pooled area under the summary receiver operating characteristic (SROC) curve was 0.99 (95% CI: 0.97 to 0.99).

Heterogeneity analysis and meta-regression analysis

Since high heterogeneity (I2 > 50) was found in our forest plots when assessing the SEN and SPE for the detection of PM, we performed meta-regression to explore the potential reasons for heterogeneity. Through our analysis, the DOR was not correlated with any factors as follows: research regions (P = 0.15); different types of validation datasets (P = 0.23); imaging modalities (P = 0.78); types of datasets (P = 0.36); total number of images (P = 0.07).

Subgroup analysis

The results of subgroup analysis are summarized in Table 2. We found imaging modalities and resources of data had no significant contributions to the diagnostic performance. For different types of validation datasets, there was a better performance in the internal dataset (SEN = 0.95, 95% CI: 0.94–0.96; SPE = 0.97, 95% CI: 0.96–0.99; AUC = 0.99, 95% CI: 0.97–1.00) than external dataset (SEN = 0.93, 95% CI: 0.92–0.95; SPE = 0.96, 95% CI: 0.94–0.97; AUC = 0.99, 95% CI: 0.98–0.99). For research regions, we found a better performance in developed countries (SEN = 0.96, 95% CI: 0.93–0.98; SPE = 0.98, 95% CI: 0.97–0.99; AUC = 0.99, 95% CI: 0.97–0.99) than developing countries (SEN = 0.94, 95% CI: 0.90–0.95; SPE = 0.96, 95% CI: 0.93–0.98; AUC = 0.98, 95% CI: 0.97–0.99). For different total sizes of data, a better performance was detected in data larger than 5000 (SEN = 0.96, 95% CI: 0.95–0.98; SPE = 0.97, 95% CI: 0.96–0.99; AUC = 0.99, 95% CI: 0.97–0.99) than smaller than 5000 (SEN = 0.93, 95% CI: 0.91–0.95; SPE = 0.96, 95% CI: 0.94–0.98; AUC = 0.98, 95% CI: 0.98–0.99).

Table 2 Subgroup analysis for the performance of AI algorithms for the detection of PM.

Sensitivity analysis

The sensitivity analysis is the repeat of the primary meta-analysis. We excluded 5 studies without sufficient information about the division of datasets or in-depth details of clinical data resources [19, 22, 29,30,31]. Then, the pooled SEN was 0.94 (95% CI: 0.90 to 0.97), and the pooled SPE was 0.96 (95% CI: 0.95 to 0.98) for the detection of PM. The results were similar to our main findings; hence, there was no evidence that our main outcome was influenced by which studies were included.

Discussion

We compared and analysed the results and characteristics of published studies and addressed the gaps in the current meta-analysis in the field of the application of AI in PM. Through our review, AI technology has the potential to benefit the detection and management procedure of PM patients in real-world settings, similar to other eye diseases. By estimation, CNV occurred in approximately 5–11% of eyes with high myopia, and early detection and interventions for high-risk lesions in PM patients are necessary to prevent underlying progression [34]. Through our review, AI models based on fundus or OCT images both achieved acceptable accuracy in the detection of CNV. According to published literature, junior to senior retinal specialists can achieve the accuracy of 0.999 for the detection of PM [35]. Despite the relatively lower accuracy compared to traditional clinical examinations, the utilization of AI can maximize the detection rates using a convenient method. Apart from CNV, several published algorithms also can identify complications in PM patients, for example, the extraction and segmentation of peripapillary atrophy, automatic quantitative analysis of fundus tessellation and automatic segmentation and measurement of the choroid layer [24, 36, 37]. These advances can help efficiently quantify large amounts of data and assist in detecting subtle differences that are difficult for ophthalmologists.

Through subgroup meta-analysis, there was no significant relationship between the diagnostic accuracy and imaging modalities or resources of databases, while there were significant relationships between the diagnostic performance and the scale of databases, the types of validation sets and the countries where the study was conducted affected. Compared to fundus images, advances in OCT can help detect more characteristics, such as macular-schisis and dome-shaped macula. With more studies included in the future, it would be more meaningful to compare the performances of AI in detail based on fundus and OCT imaging. We also believe the diagnostic performance would be further improved with the combination of AI technology and advanced imaging modalities such as ultra-widefield fundus images or swept-source OCT angiography (OCTA).

In contrast, AI algorithms demonstrated better performance in internal validation datasets than external datasets. Such a lack of reliability suggests that it is necessary to improve the generalization and robustness under different environments through a variety of methods, such as training and testing the model widely in different populations or devices [38]. Another point of interest is that we found studies in developed countries showed higher accuracy than developing countries. This might be related to insufficient capacity to conduct high-quality studies in low- and middle-income countries (LMICs). However, the Southeast Asia, South Asia, and East Asia regions bear the greatest potential burden as a proportion of the economy associated with visual impairment resulting from uncorrected myopia and MMD globally [39]. More importantly, it might be difficult for healthcare systems in these countries to cope with a relatively greater burden, especially during the COVID-19 pandemic period. At the same time, detailed health economic evaluation for the application of AI-assisted models in real-world settings is required to identify the priorities and strategies of implementation.

More studies with high quality are necessary to enhance reliability to unleash greater potential in real-world settings. First, we can incorporate the data from multimodality images into future AI systems to build a screening system that can detect more dimensional characteristics in PM patients. Next, the establishment of longitudinal medical records for patients can help explore morphological characteristic parameters closely related to the progression of PM. Predicting potential risk of developing PM from school-aged myopia can provide evidence for precise individualized interventions. Additionally, the algorithms developed with the integration of more information, such as genomic readouts and metabolomics from patients, will increase the diagnostic or predictive power.

We should state that there exist several limitations in our meta-analysis and review. First, our study only confirmed the diagnostic power of AI in the detection of PM, but it is still unknown whether AI algorithms have overall good performance for grading PM according to different category systems. Second, there was high heterogeneity among the included studies due to the varying study designs, imaging modalities, algorithm characteristics and threshold effects. Third, the definition of pathological myopia was still controversial, META-PM based on fundus figures only, and ATN classification combined fundus figures and OCT, whereas both ignored the existence of posterior staphyloma. Fourth, some included studies were published in the journals of AI or computer science, and few clinical details were reported. Thus, there were unknown risks of bias in the selection of patients and the patient flow. Moreover, it has been reported that the QUADAS-2 tool might underestimate the risk of bias of the included studies [9]. Fifth, some included studies used the duplicated database (PALM database), and there were overlapping data in our pooled meta-analysis with underlying implications. Sixth, as in many AI-based studies especially in big image databases, the sampling mechanisms are unclear, and many diagnostic studies were case-control, meaning that diseased and non-diseased subjects were recruited based on different criteria. Last, current DL algorithms lack the interpretability for their detection outcome, which is called the “black box phenomenon”. The improvement of interpretability will help ophthalmologists identify probable structural features related to better diagnostic performance.

In conclusion, our review demonstrated the excellent performance of current AI algorithms in detecting PM patients based on fundus and OCT images, and AI-assisted automated screening systems are promising for ameliorating increasing demands in clinical settings. To the best of our knowledge, this was the first published meta-analysis for the assessment of AI algorithms applied in PM and PM-related CNV quantitatively. Nevertheless, to provide substantial benefits in regular clinical practice under different conditions, we still need to conduct continuous innovative research with newly developed algorithms and larger-scale databases.

Summary

What was known before

  • Current applications of artificial intelligence (AI) in ophthalmic diseases have covered a variety of aspects with good performance, including the diagnosis and classification of pathologic myopia (PM).

  • There are still relatively great variations among studies in development procedures, databases, sample resources, and many aspects of methodology.

What this study adds

  • Our study demonstrated the excellent performance of current AI algorithms in detecting PM patients based on fundus and OCT images, and this was the first published meta-analysis for the assessment of AI algorithms applied in PM quantitatively.

  • For the detection of PM, the summary area under the receiver operating characteristic curve (AUC) was 0.99 (95% confidence interval (CI): 0.97 to 0.99), and the pooled sensitivity and specificity were 0.95 (95% CI: 0.92 to 0.96) and 0.97 (95% CI: 0.94 to 0.98), respectively.

  • It provides crucial evidence for the application of AI-assisted automated screening systems to ameliorate increasing demands in the healthcare system.