Introduction

Myopia is the most common refractive error and a leading cause of reversible visual impairment and blindness worldwide [1, 2]. The “myopia boom” has raised significant international concern in the twenty-first century. The prevalence of myopia keeps increasing in recent years, especially in East Asia where 80–90% of 18-year olds are myopic, and 10–20% are highly myopic [3, 4]. Based on current trends, it is estimated that by 2050, there will be 4758 million people with myopia and 938 million people with high myopia globally [4]. Prevention of myopia onset and progression is critical, since myopia poses an ever-present threat to the quality of life, and high myopia can be further complicated by a number of vision-compromising diseases, including myopic maculopathy and glaucoma [5, 6]. The good news is that various interventions have been proposed to control myopia progression effectively [7,8,9]. Spectacles are the most widely used method for the correction of myopia [10]. Many newly introduced contact lenses [11] and spectacles [10, 12, 13] also appear to be very effective in slowing progression. Other commonly used effective myopia control methods include low-dose atropine eye drops and orthokeratology [14, 15]. Unlike increased outdoor time which should be encouraged for everyone, the above-mentioned medical interventions differ in expenses, effects and adverse effects, and clinical decision of treatment options should be based on an individual basis [13, 16]. Accurate myopia prediction could help identification of high-risk children for more timely and effective intervention, slowing the pace of myopia onset or progression and leading to improved visual outcome and quality of life [17].

Our understanding of myopia aetiology has been advanced with growing evidence reporting risk factors for myopia onset or progression, including age, gender, parental myopia, susceptible genes, and outdoor activities [7, 18,19,20]. On this basis, researchers had developed various types of prediction models to foreshadow the risk of myopia or high myopia in different populations [21,22,23,24]. Despite the growing interests in this area, no model can be considered for wide application in clinical practice at present. Existing models differ in definitions of myopia, subsets of predictors and statistical prediction methods, and with varying prediction performance [21, 25,26,27]. This study aims to provide an evidence-based update on existing myopia prediction models and discuss future challenges.

Literature search

We conducted a systematic search of all published articles related to myopia prediction model published between January 1, 1990, and February 1, 2021, by searching the online databases, including PubMed, Embase, and Google Scholar. The search terms contained (“myopia” or “high myopia” or “spherical equivalent” or “spherical equivalence” or “refractive error” or “refraction”) and (“predict” or “prediction” or “predictor” or “predictive”). Detailed search terms could be found in the supplementary file. Published studies were included if they were prospective observational studies conducted on humans and reported the use of a certain method to predict the future myopia risk, including myopia onset, myopia progression, and specific spherical equivalence (SE). Only full-text studies published in English were included. Unpublished studies and meeting abstracts were not included due to uncertainly of methodological quality. Studies evaluating refraction prediction after treatments, including orthokeratology, atropine eyedrop and cataract surgery, were further excluded. A total of 3581 articles were identified in the initial search, and titles of the articles were screened by XH and YC, independently. After excluding duplicates papers and those that did not meet the inclusion criteria, a total of 25 full-text articles were subsequently screened by XH and YC. After full-text review, an additional six articles were excluded: one that constructed a risk score system for myopia symptom based on cross-sectional data [28], six estimated the current myopia status or SE instead of future myopia risk [24, 29,30,31,32,33], and one that used a myopia growth chart to estimate myopia progression but no statistical prediction model was established [23]. In a nutshell, this review was based on 17 core papers that utilized different data types for predicting future myopia risk (Table 1).

Table 1 A summary of prediction models for myopia.

Myopia prediction workflow

To develop a myopia prediction model, there are five things to consider: (1) outcome definition; (2) data acquisition; (3) predictor selection, (4) model development, and (5) model evaluation (Fig. 1).

Fig. 1: The workflow for myopia prediction.
figure 1

These five steps indicate the general workflow to develop a myopia prediction model.

The first step in myopia prediction is to properly define the outcome, which could be the incidence of myopia/high myopia onset or progression over a certain period of time or the specific SE value. In previous studies, the most commonly used definition was SE ≤ −0.75D in the right eye for myopia, and ≤−6D in the right eye for high myopia [22, 34, 35]. Future studies should follow the definition of myopia (SE ≤ −0.50D when ocular accommodation is relaxed) and high myopia (SE ≤ −6D when ocular accommodation is relaxed) in the recent international Myopia Institute report [36].

Data acquisition is the most crucial step as myopia prediction is essentially a data-driven problem. It is also a hypothesis-driven problem and researchers need to have a comprehensive understanding of existing literature on myopia prediction models to decide which data to include. Proper selection and definition of potential predictive factors before study initiation is the cornerstone for successful and effective data acquisition. Multiple data types could be used for myopia prediction, including socioeconomic factors, ocular biometry, lifestyle factors, genetic and imaging data. The vast majority of previous studies collected data for myopia prediction from large cohort studies [20, 22, 37], due to the accessibility of diverse structured data from long-term follow-up. Other available data sources include clinical data (e.g., electronic medical records) and publicly available datasets.

Many statistical methods could be used to select candidate predictors from the specific dataset. Most previous studies leveraged linear regression or logistic regression model (depending on whether the outcome is numeric type or categorical type) for correlation analysis, and chose factors with statistically significant associations with outcome (P < 0.05) as predictors for further analysis [27, 38]. Other studies also employed tree-based machine learning (ML) models to cope with more complex datasets with numerous factors, such as random forest and gradient boosting regression tree, which can provide the importance of each input factor [21, 26].

The fourth step is to develop a myopia prediction model based on the optimal subset of factors. In some cases, the regression model used for predictor selection can be directly applied as the prediction model or with mild adjustments. Other statistical models such as the discrete-time survival model and generalized estimating equations were also employed per the type of the expected outcome [39, 40]. Recently, with the increasing availability of medical data and computational power, growing attentions have been paid to ML and deep learning algorithms when facing high-dimensional and large-scale datasets [26, 41]. The model development is generally conducted in a ‘learning’ fashion, i.e., the model is driven by existing data to best fit the ground truth of the given outcome.

The final step is to evaluate the best-fit model. Receiver-operating characteristic (ROC) and area under the ROC curve (AUC) are the most commonly used metrics for assessing the discriminative ability of models in predicting the future presence of myopia [42]. Other metrics include sensitivity, specificity, accuracy, mean square/absolute error, R square and so on. Some studies also imposed resampling processes (such as bootstrapping) or cross-validation methods to gain higher statistical significance [43]. An efficient prediction model is expected to be able to accurately estimate myopia risk for not only the existing data but also unseen data, thus additional evaluations in external independent datasets are generally preferred to demonstrate the generalization performance.

Data-driven myopia prediction

When predicting myopia, it should be understood that data plays a crucial role in analyzing risk factors, identifying myopia incidents and modelling prediction problems. Driven by the increasingly available data sources, predicting myopia seems to be much more feasible than before. In the literature, myopia prediction can be generally classified into four categories from a data-driven perspective, including prediction based on baseline refraction or biometric data, prediction based on lifestyle data, prediction based on genetic data and prediction based on data integration, as shown in Fig. 2.

Fig. 2: A taxonomy of myopia prediction methods from a data-driven perspective.
figure 2

Myopia prediction can be generally classified into four categories from a data-driven perspective, and the contents shown in the colored blocks indicate related publications in the literature.

Prediction based on baseline refraction or biometric data

Baseline refraction and ocular biometry have been consistently reported as risk factors for myopia onset and progression [44]. These factors are relatively easy to collect and have been widely used in previous studies for myopia prediction.

The prediction of myopia based on baseline refraction or biometric data can be traced back to 1999, when Zadnik et al. tried to predict the onset of juvenile myopia based on 554 children who were not myopic at baseline from the Orinda Longitudinal Study of Myopia (OLSM) [34]. Myopia was defined as at least −0.75D in the right eye measured by cycloplegic autorefraction. They found that the best single predictor of future myopia onset was the spherical refractive error at baseline. The AUC for the mean sphere was 0.880, with a sensitivity of 86.7% and a specificity of 73.3%. Further combining of corneal power, Gullstrand lens power, and AL in the logistic model resulted in minor improvements (AUC = 0.893). Later in 2015, Zadnik performed a more comprehensive prediction model for myopia onset under the same definition based on the Collaborative Longitudinal Evaluation of Ethnicity and Refractive Error (CLEERE) Study [37]. Thirteen risk factors from 4512 ethnically diverse, nonmyopic school-aged children were assessed. Eight were associated with the onset of myopia, including SE at baseline, parental myopia, AL, corneal power, crystalline lens power, and the ratio of accommodative convergence to accommodation (AC/A ratio), horizontal/vertical astigmatism magnitude, and visual acuity (VA). Multiple prediction models were constructed based on different combinations of these eight factors, and backward stepwise selection and tenfold cross-validation were used for model comparison. These models achieved an AUC of 0.87–0.93 in the prediction of myopia onset. SE was also found to be the single best predictive factor, given that the AUC only decreased by 0.01 or 0.02 when the number of predictors reduced from all eight to SE only. Findings by Ma et al. [45] also supported that single baseline SE could provide effective prediction for future myopia risk. In this study of 1856 students from Shanghai, China, the AUC of baseline AL, AL/CR and SE to predict 4-year incident myopia (cycloplegic SE ≤ −0.5D) was 0.585, 0.740, and 0.839, respectively. Combining baseline SE, AL/CR, age, gender and parental myopia resulted in an AUC increment of only 0.022 compared with using baseline SE alone.

In addition, Jones et al. [46] used SE of first-grade students and parental myopia to predict myopia onset (cycloplegic SE < −0.75D) between the second and eighth grades based on the CLEERE Study. A total of 1854 nonmyopic first graders were included, and the sensitivity and specificity were 62.5% and 81.9%, respectively. They did not perform the ROC curve to evaluate the model performance. Zhang et al. [47] built a 3-year myopia onset prediction model using baseline data from 236 children in Xiamen, China, and further validated its performance in 1979 ethnically Chinese children from Singapore where the myopia prevalence was significantly higher. The model including gender, height, VA, AL, anterior chamber depth (ACD), lens thickness, vitreous chamber depth, and CC achieved an AUC of 0.974 in Xiamen and 0.815 in Singapore. It is noteworthy that this study conducted external validation for the myopia prediction model and proved that models based on one population could be potentially used for other populations with greater myopia prevalence. Matsumura et al. [48] found that year 1 myopia progression is the best predictor for subsequent 2-year myopia progression based on schoolchildren aged 7–9 years from the Singapore Cohort Of the Risk factors for Myopia (SCORM). Reported AUC for year 1 myopia progression, baseline SE, and age of myopia onset were 0.77, 0.70 and 0.66, respectively.

Centile curve was first used as a prediction tool in ophthalmic research by Chen et al. [49] where the outcome of interest was the onset of high myopia. This research group generated reference centile curves based on cycloplegic refraction data of 4218 children aged 5–15 years from the Guangzhou Refractive Error Study in Children (RESC) study and 354 first-born twins from the Guangzhou Twin Eye Study (GTES) [49]. The curves were used to predict the risk of developing high myopia (cycloplegic SE ≤ −6D) using the follow-up data in GTES. They found that the 5th centile showed the most effective diagnostic value with a sensitivity of 92.9%, a specificity of 97.9% and a positive predictive value of 65.0%. This method was adopted by another study in 2019. Diez et al. built percentile curves of AL from 12,554 children aged 6–15 years in Wuhan China, and used it to predict the likelihood of suffering high myopia (cycloplegic SE ≤ −5D) in 3 years in an independent subset of 226 children [25]. Reported AUC in this study ranged from 0.781 to 0.876 for children of different ages and genders.

The studies mentioned above suggested that baseline SE was the single best predictor for myopia onset, and preceding SE change was the best predictor for future myopia progression. In addition to statistical models, the centile curve is another more straightforward method for risk estimation on high myopia development. One might reckon that myopia or high myopia, as disease modalities, are all defined by SE. It would not be too surprising to identify SE collected some years ago as the best predictor, just as adult height could be accurately predicted based on height during childhood [50]. The challenge is how early and how accurate this prediction can be performed. Nevertheless, most previous studies included school-age children and lacked external validation, whether these findings could be applied to younger and older subjects or populations with different ethnicities needs further investigation.

Prediction based on lifestyle data

Lifestyle data, including more time spent on near work, less outdoor activity, and attending cram schools, were found to be risk factors for myopia [51,52,53]. Effect of adding lifestyle data on myopia prediction ability had also been evaluated in several previous studies. Given that lifestyle factors are modifiable, a better understanding of its role in myopia prediction would be very meaningful.

The benefit of adding lifestyle data into myopia prediction model was suggested in two studies. Jones et al. [35] reported that outdoor activities and parental myopia were important predictors of myopia based on the OLSM study (AUC = 0.75), which significantly improved the performance of traditional prediction model which included sphere, AL and corneal power (to AUC = 0.90) [35]. In Tideman et al. [54] included children aged 6–9 years from the Generation R study in Rotterdam, and environmental factors were found to be associated significantly with both increase in AL and incident myopia. They predicted myopia (cycloplegic SE ≤ −0.5D) based on seven parameters, including parental myopia, one or more books read per week, time spent reading, no participation in sports, non-European ethnicity, less time spent outdoors, and baseline AL/CR ratio, with an overall prediction accuracy of 0.78 (AUC) [54]. In addition, William et al. [55] analyzed 1991 twin participants from the longitudinal UK-based Twins Early Development Study [55]. They found that age, gender, maternal education, fertility treatment, summer birth and hours spent playing computer games were crucial predictors of myopia which in total could explain 4.4% of the variance in SE, with an AUC of 0.68.

However, several other studies failed to confirm the effect of lifestyle factors in myopia prediction. Studies by French et al. [20] reported that baseline SER (AUC = 0.84), time spent outdoors (AUC = 0.64), near work (AUC = 0.61), parental myopia (AUC = 0.65), and ethnicity (AUC = 0.67) were all significant predictors of incident myopia (cycloplegic SE ≤ −0.50D). Adding lifestyle factors to the basic model consisting of only baseline SE could only slightly improve the prediction power in the younger cohort (aged 6 at baseline), and had little effect in the older cohort (aged 12 at baseline). Similarly, using data from the SCORM study, Chua et al. [38] found that age of myopia onset alone could effectively predict high myopia (AUC = 0.85), while the addition of other factors including school and books per week only marginally improved the prediction power (to AUC = 0.87).

The limited effect of adding lifestyle data on myopia prediction model could be due to that many environmental effects had already been reflected in baseline SE or age at myopia onset. On the other hand, our lack of understanding of how lifestyle factors cause or exacerbate myopia imposes a limit on the performance of prediction models. Hence, the role of lifestyle factors in myopia prediction still warrants further study.

Prediction based on genetic data

During the last decades, genetic investigation had largely improved our understanding of the molecular mechanisms underlying myopia and impaired vision [19, 56]. The Consortium of Refractive Error and Myopia (CREAM) and 23andMe provide the largest combined genome-wide association studies for refractive error [57, 58]. However, these genes could only explain less than 10% of the inheritance variance of myopia. A recent study by Hysi et al. further combined the data from UK Biobank and the Genetic Epidemiology Research on Adult Health and Ageing, and identified 336 novel genetic loci associated with refractive error, which increased the explanation ability of myopia inheritance to 18.4% [33]. Despite the large quantity of genes identified, the role of genetic data in myopia prediction had been less investigated.

Several studies had used genetic data to estimate the current SE in adults [24, 33]. Ghorbani Mojarrad et al. [24] conducted a meta-analysis of three GWAS including a total of 711,984 individuals . A polygenic risk score (PRS) derived from GWAS data for refractive error was evaluated in 1516 adults aged 24–51 years from the Avon Longitudinal Study of Parents and Children (ALSPAC) mothers cohort. The PRS had an AUC of 0.67 for myopia (SE ≤ −0.75D), 0.75 for moderate myopia (SE ≤ −3D), and 0.73 for high myopia (SE ≤ −5D). In addition, a recently published study by Hysi et al. [33] performed a large meta-analysis of GWAS which involved 542,934 European participants. They reported that a combination of age, gender and 890 significant SNPs achieved an AUC of 0.67, 0.74 and 0.75 for myopia definition of ≤−0.75D, ≤−3.00D, and ≤−5D respectively.

Mojarrad et al. [27] did a retrospective analysis of non-cycloplegic autorefraction data of children aged 7 and 15 years from the ALSPAC birth cohort study. Genetic variants associated with refractive error from CREAM and 23andMe were used to calculate a genetic risk score (GRS). Three prediction models were built, with model A including age, gender, number of myopia parents (NMP), and model B, including age, gender, GRS, and model C including age, gender, NMP and GRS. Compared to model A and B, the combined model C could better estimate the current refractive error at both aged 7 years (R2 = 3.0% and 1.1% vs. 3.7%) and 15 years (R2 = 4.8% and 2.6% vs. 7.0%). In predicting incident myopia, inclusion of GRS in the Cox proportional hazard model also improved the model fit compared to using NMP alone (p < 0.0001).

Chen et al. [22] assessed the effect of adding genetic information to predict future myopia risk, based on cycloplegia data from 1063 first-born twins from the GTES. GRS was calculated based on the CREAM study, and five models were constructed and compared. Model 1 included age, age square, and gender; model 2 included age, age square, gender, parental SE, and outdoor and near work time; model 3 included age, age square, gender, GRS, and outdoor and near work time; model 4 included age, age square, gender, and parental SE; and model 5 included age, age square, gender, and GRS. Results showed that model 1 had achieved a good performance in predicting high myopia (SE ≤ −6D) at the age of 18 (AUC = 0.95), whereas adding GRS data did not significantly improve the model performance. This has been attributed to the low penetrance and small effect size of included SNPs in this study, and that the SNPs were taken [59] from GWAS studies based on Caucasian instead of Asian populations. This study further suggested that adding more follow-up visits data into the model could enhance the prediction performance, and for participants with baseline data only, the age of 13 appeared to be the earliest age threshold for high myopia prediction.

Though the rapidly increasing myopia prevalence in recent years has mainly been attributed to environmental factors, both nature (genetics and heredity) and nurture (environment and lifestyle) play a part in the myopia aetiology [10]. The critical advantage of genetic data over refraction in myopia prediction is that the genotype is fixed at conception, and genetic data could be combined with age, gender and other risk factors (e.g., parental SE) for myopia risk prediction at very young ages. Existing studies suggested a very limited added value of genetic data in myopia prediction, this could be due to the current limited understanding of myopia-related SNPs and gene–environmental interaction, or environment solely plays a dominant role. The genetic background for high myopia and extremely high myopia is stronger, and theoretically might benefit more from genetic risk prediction. Guggenheim et al. has provided a detailed review on the genetic prediction of myopia, they suggested that genetic prediction had the potential to outperform existing prediction models based on baseline SE or biometric data [18, 59, 60]. With the identification of more susceptible genes with larger effects from large-scale genetic studies in the future, more studies are necessary to enhance our understanding and ability of myopia prediction based on genetic data.

Prediction based on integrated data

With the wide application of electronic medical records and increasing affordability and convenience of clinical examinations, large-scale real-world medical dataset are accumulating in most hospitals. These datasets provide an extraordinary source for comprehensive clinical analysis and validation. Still, the challenge remains in how to effectively remove the “noises” and extract the “signals” from the integrated data.

Lin et al. [26] predicted myopia among Chinese school-aged children using refraction data from electronic medical records of 129,242 individuals from eight ophthalmic centres. Age, SE and annual progression rate were used to develop an ML algorithm to predict SE and high myopia onset (SE ≤ −6D) in the future 10 years. Their algorithm accurately predicted the presence of high myopia in internal validation of data from the Zhongshan Ophthalmic Center (AUC: 0.903-0.986 for 3 years, 0.875–0.901 for 5 years, and 0.852–0.888 for 8 years), external validation of data from the remaining seven centres (AUC: 0.874–0.976 for 3 years, 0.847–0.921 for 5 years, and 0.802–0.886 for 8 years), and multi-resource testing of data from two population-based cohorts (AUC: 0.752–0.869 for 4 years). The algorithm also achieved clinically acceptable prediction of the actual refraction values at future time points (MAE: 0.253–0.395 for 3 years, 0.394–0.496 for 5 years, and 0.503–0.799 for 8 years). Yang et al. [21] built a myopia prediction model based on 3112 primary school students from Henan, China through ML methods. A feature selection method was first used to construct a predictor subset from the 200 factors in the dataset for model training. The prediction model was built based on the SVM model, and achieved an AUC of 0.87 and an accuracy of 0.79 in predicting whether a sixth-grader would become myopic based on data from grades 1–5. They have also adopted a data transformation technique, which improved the accuracy to 0.93 and AUC to 0.98 in the final model.

These two studies suggested the potential of using a big integrated dataset in myopia prediction. With the emerging new dataset and advancing technique, integrated real-world big datasets are tended to be a rising topic in future medical research.

Challenges and future directions

Successful myopia prediction has an important reference value for effective myopia control in real-world clinical practice. While outdoor activity is encouraged for all children regardless of potential myopia risk, more frequent follow-ups and invasive interventions (e.g., Orthokeratology and atropine) should be reserved for those with increased myopia risk [7, 61]. Data regarding the effect of these inventions on children with different levels of future myopia risk is needed to guide better clinical management. It should also be noted that there is considerable heterogeneity in myopia pathogenesis, and currently there is no reported prediction model specifically for more rarer forms of myopia (e.g., associated with keratoconus or extreme lenticular myopia) which may progress very differently. Recognizing them is also one of the first steps towards better prediction and personalized medicine in relation to myopia. Looking upon existing myopia prediction methods, several challenges remain.

Predicting myopia incident and progression is ultimately a data-driven problem, and accurately forecasting future incidents relies on the availability of a representative, authentic and comprehensive dataset. Hence, how to establish large datasets with abundant and reliable predictors is challenging. The quality of ground truth determines the performance of prediction. Therefore, standardized methods should be used for data collection and myopia definition. Also, the representativeness of the population should be considered, and multi-ethnic population-based studies were preferable for model training and validation. On the other hand, the availability of multiple data sources is important for model construction and comparison. In addition to the above-mentioned data types, imaging data have been increasingly acknowledged as a valuable data source for medical analysis, especially with the rapid development of artificial intelligence [62]. In the last 4 years, multiple studies have proved the ability to successfully estimate current SE based on various sources of images, including fundus images, ocular appearance images, and photorefraction images [29, 30, 32]. Their findings demonstrate the potential of imaging data for myopia risk prediction which could be considered as a future direction. Moreover, most existing studies can only predict risk in 2–4 years, longer-term follow-up datasets are highly desirable for improving the ability of myopia prediction in the more distant future.

With available datasets, the next challenge is how to construct the prediction models. Previous studies mostly applied traditional regression or survival analysis, future research should explore new statistical methods and models for making better use of new data sources and complex datasets to improve the prediction performance. Considering the increasing availability and accessibility of medical data in the big data era, integrating multi-modality data across multiple domains for myopia prediction appeals to the practitioner in this field. By leveraging appropriate ML methods based on an adequate dataset for myopia prediction, identification of a better prediction model is possible.

How to validate or evaluate the model performance is another challenge to consider. Successful validation in independent datasets with stable performance, preferably in multi-ethnic populations, is warranted before the real-world application of the prediction model in clinical practice. In order to directly compare the performance of myopia prediction models from different studies, a large publicly available dataset should be established for external validation in the future. Furthermore, even after achieving an accurate myopia prediction model, how to properly deploy this tool in the clinical practice and how to protect patient information privacy present future challenges.

Conclusion

In summary, we provided a comprehensive review and research outlook of myopia prediction methods, datasets and performance, which could serve as a useful reference and valuable guideline for future research. We summarized the research methodology as an integrated process of outcome definition, data acquisition, predictor selection, model development and model evaluation. A thorough literature review was conducted which collectively suggested that age-specific SE was the currently known strongest predictor for myopia prediction, while the additive effect of data including lifestyle, genetic and imaging data was inconclusive. Many challenges existed in this emerging field of myopia prediction. With the development of new analytic methods and accumulating real-world medical datasets, future studies hold promise in better prediction ability of myopia onset and progression, leading us one step closer to the ultimate goal of identifying high-risk populations for timely targeted intervention in clinical practice. We want to emphasize that while more research and better prediction models are always helpful and needed, the major task and challenge to fight the current myopia epidemic are successful implementation of currently available effective myopia prevention strategies (e.g., increased outdoor time) and timely diagnosis and treatment for myopia individuals to minimize the risk of progression.