The usefulness of the Deep Learning method of variational autoencoder to reduce measurement noise in glaucomatous visual fields

The aim of the study was to investigate the usefulness of processing visual field (VF) using a variational autoencoder (VAE). The training data consisted of 82,433 VFs from 16,836 eyes. Testing dataset 1 consisted of test-retest VFs from 104 eyes with open angle glaucoma. Testing dataset 2 was series of 10 VFs from 638 eyes with open angle glaucoma. A VAE model to reconstruct VF was developed using the training dataset. VFs in the testing dataset 1 were then reconstructed using the trained VAE and the mean total deviation (mTD) was calculated (mTDVAE). In testing dataset 2, the mTD value of the tenth VF was predicted using shorter series of VFs. A similar calculation was carried out using a weighted linear regression where the weights were equal to the absolute difference between mTD and mTDVAE. In testing dataset 1, there was a significant relationship between the difference between mTD and mTDVAE from the first VF and the difference between mTD in the first and second VFs. In testing dataset 2, mean squared prediction errors with the weighted mTD trend analysis were significantly smaller than those form the unweighted mTD trend analysis.

Training dataset. All VF data recorded at the University of Tokyo Hospital between 2002 and 2018 was included in the training dataset ('Tokyo dataset'). The data consisted of 82,433 VFs from 16,836 eyes of 9,139 subjects. All the VFs were measured using the HFA (24-2 or 30-2 Swedish Interactive Threshold Algorithm, SITA, standard program). Reliability criteria applied were: fixation losses (FL) less than 33%, false-positive (FP) responses less than 33% and false-negative (FN) rate less than 33%.
Testing dataset 1 (analysis with test-retest dataset). This dataset included VF measured using the HFA (24-2 or 30-2 SITA standard program) twice within three months from one hundred and four eyes of 104 open angle glaucoma patients, prospectively recruited at the glaucoma clinic in the University of Tokyo Hospital.
All patients enrolled in the study fulfilled the following criteria, similarly to our previous study: 10 (1) no disease other than glaucoma that can cause VF damage; (2) at least two VF experience prior to the inclusion of this study; (3) glaucomatous VF defects defined as three or more contiguous total deviation points at p < 0.05, or two or more contiguous points at p < 0.01, otherwise a 10 dB difference across the nasal horizontal midline at two or more adjacent points, or MD worse than −5 dB 25 ; (4) visual acuities at least 6/6. Only reliable VFs were used in the analysis, defined as: fixation losses less than 33%, false-positive responses less than 33% and false-negative rate less than 33%.
Testing dataset 2 (trend analysis testing). This dataset consisted of 638 eyes of 417 patients with primary open glaucoma with ten VF records excluding an initial VF, with no ocular comorbidities other than glaucoma, that may affect the VF. The inclusion and exclusion criteria of this dataset were described elsewhere 26 . That is, all VFs were recorded using the HFA (SITA standard 24-2 or 30-2 test pattern with a Goldmann size III target), derived from ten institutes in Japan. Two test points correspond to the blind spot were excluded from the analyses. When a VF was measured using the 30-2 test pattern, only the 52 test points overlapping with the 24-2 test pattern were used to derive the mean total deviation value (mTD). Oher inclusion, exclusion and reliability criteria were identical to those in testing dataset 1.
Pre-processing visual fields with a variational autoencoder. The structure of the VAE model is shown in Fig. 1. This was built using the training dataset. The encoder is a 1-layer neural network consisting of 52 units (for each of the 52 TD values). This encoder is connected to 2 hidden layers consisting of 38 and 26 units, and is then represented by the mean and standard deviation of an eight-dimensional Gaussian probability density in the latent space. The decoder reconstructs the 52 TD (TD VAE ) values through a further 2 hidden layers and 1 output layer, which represents the reconstructed VF. This VAE model was optimized by maximizing the sum of the negative reconstruction loss, which is derived from the difference between the input VFs and reconstructed VFs and the Kullback-Leibler divergence between the distributions. mTD VAE was calculated as the mean of 52 the TD VAE values. In the VAE calculation, TD values were scaled to between 0 and 1 (before encoder), and then re-scaled back (after decoder) to the original values. structed using the trained VAE. Then the difference between mTD values in the first VF (the 'test' VF) and the VAE-reconstructed first VF (the 'test' VF) was calculated. In addition, the difference between mTD values in the first VF (the 'test' VF) and in the second VF (the 'retest' VF) was calculated. The relationship between these two difference values was investigated. mTD trend analysis. Using testing dataset 2, progression measured over all ten VFs (VF1-10) was regarded as a surrogate for true progression. The consistency of mTD trend analysis results was evaluated using the following measures, following our previous reports 23,24 : (1) Proportion of both progressing (PBP) was calculated as a surrogate measure for true-positive rate; i.e., where progression was significant in VF1-10 and also in shorter VFs (from VF1-9 to VF1-3). (2) Proportion of both not progressing (PBNP) was calculated as a surrogate measure for true negative rate; where progression in the complete series of VFs (VF1-10) was deemed "not significant", and progression also "not significant" in shorter subsets of VFs (from VF1-9 to VF1-5). (3) Proportion of inconsistent progression (PIP) was calculated as a surrogate measure for the false positive rate; classification based on the shorter series of VFs (from VF1-9 to VF1-5) was judged to be "significant" but progression in the complete series of VFs (VF1-10) was "not significant. " Further, mTD trend analysis was carried out using mTD values from the 1st to the 3rd VFs (VF1-3) of each patient, and the mTD values of the 10th VF test were predicted. The same procedure was carried out using the mTD values in longer series: VF1-4, VF1-5, VF1-6, VF1-7, VF1-8 and VF1-9, and the mTD values of 10th VFs were predicted every time.
In OLSLR, the regression line is decided so that the sum of squared residuals in the regression becomes minimum. In contrast, in a weighted linear regression analysis, the regression line is decided so that the sum of squared weighted residuals becomes minimum. To investigate the usefulness of the VAE for mTD trend analyses, the absolute difference between mTD and mTD VAE values were calculated for each VF in testing dataset 2 and a weighted mTD trend analysis (mTD VAE trend analysis) was performed using the difference as a weight in the regression (calculated as 1/absolute difference between mTD and mTD VAE values). Then, similarly to the standard mTD trend analysis, PBP, PBNP, PIP values and also prediction accuracy were calculated using the mTD VAE trend analysis and compared to those with the unweighted mTD trend analysis.
binomial PLR. The detailed calculation of the binomial PLR method is described in our previous report 23,24 . As detailed in our previous reports 23,24 , the assumption in PLR is that VF damage progresses linearly over time, similarly to the MD trend analysis [27][28][29] , where the null hypothesis was that the slope of VF progression was equal to 0. With this null hypothesis, slope p-values of coefficient from the linear regression can vary between the values of 0 and 1, where the numbers of test points with a p-value less than an arbitrary value would follow the binomial distribution. When this null hypothesis was rejected, a slope coefficient of zero is considered unlikely a result of random chance. In the current study, following our previous reports [22][23][24] , the significance of the entire VF progression was assessed using the four cut-off p values of 0.025, 0.05, 0.075, and 0.1. To represent these four p-values, the median p value was used 30,31 . A VF sequence was regarded as "significant" when the p-value calculated with the binomial PLR was <0.025; otherwise, it was "not significant. " Using this approach, PBP, PBNP, PIP, and the time to first detect a significant progression were calculated, similarly to the MD trend analysis.
Using the PBP, PBNP and PIP summary measurements, the accuracy of the weighted binomial PLR (binomial PLR VAE ) was compared where the weight values were calculated as (1/absolute difference between TD and TD VAE values). In addition, the number of VFs required to detect significant progression for the first time was calculated for each method. The sensitivity of each method to detect progression was assessed using Kaplan-Meier survival analysis and compared using the logrank test.

Results
Testing dataset 1 (analysis with test-retest dataset). Demographic summary data of testing dataset 1 is shown in Table 1. The mTD value in the second VF was significantly related both with the mTD value in the first VF (R = 0.83, p < 0.001, linear model) and mTD VAE derived from the 1st VFs (R = 0.84, p < 0.001). There was a significant positive relationship between the difference between mTD values in the first VF and the mTD VAE values derived from the first VF and the difference between mTD values in the first VF and mTD values in the second  Testing dataset 2 (mTD trend analysis). Demographic summary data of testing dataset 2 is shown in Table 2. Baseline mTD, follow-up period between VF1 and VF10, and mTD progression rate were − 6.9 ± 6.3 [Mean ± Standard Deviation] dB, 5.4 ± 1.1 years, and − 0.26 ± 0.46 dB/year, respectively. There was a significant relationship between the mTD and mTD VAE (p < 0.001, linear mixed model where random effects were subject and number of VF).The PBP values with the standard unweighted mTD trend analysis and the weighted mTD VAE trend analysis are presented in Fig. 3. PBP values were 0.33, 0.41, 0.55, 0.75, 0.78, 0.87, and 0.90 from VF1-3 to VF1-9 with the mTD trend analysis, respectively, whereas they were 0.41, 0.67, 0.68, 0.72, 0.75, 0.71, and 0.76, respectively, with the mTD VAE trend analysis. There was no significant difference in the PBP values of the two methods (P = 0.14, paired Wilcoxon test).
Binomial PLR. Figures 7,8,and 9 show the comparisons of PBP, PBNP, and PIP values between the unweighted binomial PLR and weighted binomial PLR VAE . The PBP values with binomial PLR ranged from 0.07 with VF1-3 to 0.81 with VF1-9, whereas those with binomial PLR VAE were between 0.21 with VF1-3 and 0.81 with VF1-9. The PBNP values with binomial PLR ranged from 0.89 with VF1-7 to 0.95 with VF1-3, whereas those with binomial PLR VAE were between 0.84 with VF1-3 and 0.92 with VF1-8 and VF1-9, respectively. The PIP values with binomial PLR ranged from 0.088 with VF1-9 to 0.51 with VF1-3, whereas those with binomial PLR VAE were between 0.090 with VF1-9 and 0.44 with VF1-3, respectively. There was not a significant difference in the values of PBNP and PIP (p = 0.078 and 0.078, paired Wilcoxon test), whereas the values of PBP with binomial PLR VAE were significantly higher than those with binomial PLR (p = 0.016, paired Wilcoxon test). Kaplan-Meier survival analysis and the logrank test indicated that the binomial PLR VAE detected significantly more progressing eyes than the binomial PLR, (P < 0.0001) (Fig. 10). The time to classification of progression with each method was: 6.8 ± 2.9 (mean ± SD) years with mTD trend analysis, 4.3 ± 1.6 years with binomial PLR VAE , and 4.7 ± 1.5 years with binomial PLR.

Discussion
In the current study, a VAE model was developed using 82,433 VFs from 16,836 eyes of 9,139 subjects. The usefulness of this method to improve the reproducibility of mTD measurements, and accuracy of trend analyses was investigated. VF reproducibility, in the form of test-retest mTD, was better using the VAE-derived measurement (mTD VAE ). The accuracy of mTD trend analysis and binomial PLR was enhanced using the mTD VAE ; further, the sensitivity of binomial PLR was improved using mTD VAE .
VF data are inherently associated with measurement noise. In the current study, it was suggested that it is beneficial to consider mTD VAE in addition to mTD itself to predict the mTD value in the retest VF. In particular, when mTD VAE took a larger value than the value of mTD in the first VF, mTD in the second VF tended to be larger than www.nature.com/scientificreports www.nature.com/scientificreports/ that in the first VF (suggesting the measurement may be under-estimated). Conversely when mTD VAE derived from the first VF took a smaller value than the value of mTD, mTD value in the second VF tended to be smaller than that in the first VF (suggesting the measurement may be over-estimated).
Visual field measurement noise has a considerable effect on the accuracy of trend analyses 6 . In the current study, the average mTD progression rate was −0. 26 33 . As shown by the analysis of the test-retest data in the current study, mTD VAE was related to mTD in the second VF after an adjustment for mTD in  the first VF. This implies that the accuracy of mTD trend analyses may be improved by considering the difference between mTD VAE and mTD. Indeed, the current results suggested that applying a weighted linear regression using these differences as weights yielded more accurate predictions compared to the unweighted approach (see Fig. 6).
We previously reported that applying least absolute shrinkage and selection operator (LASSO) in linear regression resulted in much more accurate prediction error than the conventional mTD trend analysis 34 . Similarly, we also reported that Variational Bayesian Linear Regression, in which the mTD trend analysis was optimized by considering the temporal and spatial VF defect patterns, enabled more accurate predictions of VF progression compared to the conventional mTD trend analysis 35,36 . The magnitude of increase in prediction accuracy with the mTD VAE trend analysis is smaller than that observed in these previously reported models. However, accuracy may be further improved by combining the current approach with these other regression models.
There have been previous studies which suggested the usefulness of non-linear regression, instead of liner regression, in VF trend analysis 17,[37][38][39][40] . However we previously investigated the usefulness of the application of such non-linear regression approaches (exponential, quadratic, and logistic regressions, as well as robust regression models) in VF trend analysis 41 . The experiment setting was very similar to that in the current study: future VF was predicted using prior (shorter) VF sequences. As a result, it was suggested no improvement of  the prediction error was obtained by any method, compared to the conventional ordinary least squares linear regression. In addition, there is another non-negligible drawback of the application of the non-linear regression model at the clinical settings; significance of the obtained non-linear curve cannot be calculated. Thus, although it may be of interest to further investigating the usefulness of applying the current approach to such non-linear regression models, the clinical usefulness would be limited.
There was not a significant difference between PBP (Fig. 3) and PIP (Fig. 4) values for the standard mTD trend analysis and the proposed mTD VAE trend analysis. This suggests that these methods have similar sensitivity and false positive rates when diagnosing progression. On the other hand, PBNP value with the mTD VAE trend analysis, however, was significantly higher than that with the conventional mTD trend analysis (Fig. 5), suggesting the new approach has better specificity. We previously reported that applying the binomial test to PLR resulted in improved PBP and PIP values compared to standard mTD trend analysis. We further investigated whether using a weighted PLR (with weights equal to the differences between TD and TD VAE values at each test point) is beneficial in binomial PLR. Sensitivity (Fig. 7) and the false positive rate (Fig. 8) were not significantly different between the two methods, however significantly higher PBNP values were obtained with binomial PLR VAE compared to binomial PLR (Fig. 9). Furthermore, the sensitivity to detect progression was significantly better with binomial PLR VAE than with binomial PLR (Fig. 10).
In the current study, there was not a significant positive relationship between the difference between mTD and the mTD VAE values, and FL, FP and FN. FL, FP and FN are the indices currently used to assess the reliability of measured VF. More specifically, FL, FP, FN is thought to indicates test reliability and vision fixation, "trigger-happy" patients, and inattention during an examination 25,[42][43][44][45][46] . While some past studies have reported on the usefulness of these indices 47,48 , more recent studies have suggested their limitations; for instance, FLs can also result from the mislocalization of the blind spot 49 and fixational instability can be found even in well trained observers 43,50 . A high FN rate is reported to be associated with the amount of field loss as well as threshold reproducibility 4 . The VF noise estimated by the difference between mTD and the mTD VAE values cannot be explained by these VF reliability indices, but we speculate that this is because of these limitations of these reliability measures. Another possible approach would be further investigating this issue using a microperimetry with retinal tracking, such as MP-3 (Nidel Co.Ltd., Aichi, Japan), because more accurate assessment of VF can be conducted preventing the effect of eye movement (mis-location) 51 .
One of the limitations of the current study is a lack of results from the HFA 10-2 test. Recent studies have revealed that it is recommended to measure the HFA 10-2 VF in addition to the HFA 24-2 52-57 . In addition, damage to this area of the VF is more directly associated with patients' vision related to the quality of life 58,59 A future study should be attempted shedding light on the usefulness of VAE in the HFA 10-2 test. In addition, various spatial filter methods, such as 60 , have been reported which are other possible approach to reduce VF noise. It would be of interest to investigate the usefulness of them compared to VAE in a future study. In addition, generative adversarial network (GAN) 61 is further another possible deep learning approach to reduce noise in VF. In general, GAN generates images which look more natural by human beings compared to VAE, however VF is not a material to be recognized the shape by human beings, so this merit may and may not be observed in VF. It would be of interest to compare the usefulness of VAE and GAN in a future study.
In conclusion, we developed a method to reconstruct the VF measurement using a deep learning method. The approach appears to be useful to predict MD value in the retested VF and also to improve the reliability of MD trend analyses and also binomial PLR.