Density-based classification in diabetic retinopathy through thickness of retinal layers from optical coherence tomography

Diabetic retinopathy (DR) is a severe retinal disorder that can lead to vision loss, however, its underlying mechanism has not been fully understood. Previous studies have taken advantage of Optical Coherence Tomography (OCT) and shown that the thickness of individual retinal layers are affected in patients with DR. However, most studies analyzed the thickness by calculating summary statistics from retinal thickness maps of the macula region. This study aims to apply a density function-based statistical framework to the thickness data obtained through OCT, and to compare the predictive power of various retinal layers to assess the severity of DR. We used a prototype data set of 107 subjects which are comprised of 38 non-proliferative DR (NPDR), 28 without DR (NoDR), and 41 controls. Based on the thickness profiles, we constructed novel features which capture the variation in the distribution of the pixel-wise retinal layer thicknesses from OCT. We quantified the predictive power of each of the retinal layers to distinguish between all three pairwise comparisons of the severity in DR (NoDR vs NPDR, controls vs NPDR, and controls vs NoDR). When applied to this preliminary DR data set, our density-based method demonstrated better predictive results compared with simple summary statistics. Furthermore, our results indicate considerable differences in retinal layer structuring based on the severity of DR. We found that: (a) the outer plexiform layer is the most discriminative layer for classifying NoDR vs NPDR; (b) the outer plexiform, inner nuclear and ganglion cell layers are the strongest biomarkers for discriminating controls from NPDR; and (c) the inner nuclear layer distinguishes best between controls and NoDR.

Step 1. Extraction of pixel intensities from OCT images to construct the PDFs.
Step 2. Transformation of the PDFs to allow for the comparison and modeling (with PDFs as data objects) using a comprehensive Riemannian-geometric framework.
Step 3. Classification of the subjects using principal component scores from PCA on the space of PDFs/ transformations. This paper is organized as follows: first, we describe the data acquisition and preprocessing steps. Then we present the results for the pairwise comparisons between NPDR, NoDR and controls, and evaluate the utility of seven retinal layers and also the overall retinal thickness. We compare these predictive results with those obtained through summary statistics. Then, we include some points of discussion based on the results of our analysis. Finally, we describe in detail (a) the Reimannian-geometric framework for the space of density functions and their transformations, (b) classification model by utilizing predictors generated from the PDFs, and (c) hypothesis tests for testing differences between PDFs of two groups of subjects.

Data
OCT scans of 107 subjects from three studies were obtained at the University of Michigan W. K. Kellogg Eye Center (Table 1). Among these subjects, 41 were healthy controls, 38 were NPDR subjects and 28 were NoDR subjects. The control group consists of subjects (age ≥ 18) who did not have diabetes. The NoDR group consists of subjects (age ≥ 18) who have been diagnosed with diabetes but no diabetic retinopathy. The NPDR group consists of patients with mild (ETDRS DR grade [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35] or moderate (ETDRS DR grade 43-53) NPDR. Subjects with other systemic or ocular diseases that could affect vision were excluded. The OCT data sets were obtained with spectral-domain optical coherence tomography (Spectralis HR + OCT; Heidelberg Engineering, Inc., Heidelberg, Germany). Each subject had only one eye selected to be included in this study (left or right) 10 . The thickness maps of the retina (full thickness) and seven individual retinal layers (NFL: Nerve Fiber Layer; GCL: Ganglion Cell Layer; IPL: Inner Plexiform Layer; INL: Inner Nuclear Layer; OPL: Outer Plexiform Layer; ONL: Outer Nuclear Layer; RPE: Retinal Pigment Epithelium) were generated from OCT scans using the built-in segmentation algorithm of Spectralis OCT Module with default parameters. Thickness maps of each layer and the whole retina were exported manually as TIF images. Detailed information of this data set can be found in Joltikov et al. 10 . Vol.:(0123456789)

Scientific RepoRtS
| (2020) 10:15937 | https://doi.org/10.1038/s41598-020-72813-x www.nature.com/scientificreports/ ETDRS rings of 1, 3, and 6 mm were placed on each thickness map to guide the analysis. The location of the ETDRS rings was carefully adjusted and reviewed by XDC. All left eye images were flipped horizontally before downstream analysis to make sure the thickness in each region in the ETDRS ring is comparable across images. Thickness matrices were obtained by matching pixels in the exported thickness heatmap to the closest value of the image's color bar. Only pixels that fall inside the ETDRS ring were utilized for the analysis. Extracted thickness values are measured in units of µ m. Missing pixel values (e.g. areas where the thickness could not be measured or areas that are inside the ETDRS ring but outside of the image) are excluded from the construction of PDFs and other summary statistics.

Results
Our data consists of thickness of seven retinal layers as well as the overall retinal thickness. In total, 107 subjects (38 NPDR, 28 NoDR, and 41 controls) were represented in the study. Classification models were built as described in the Methods section, for the pairwise comparisons of these categories. To do this, PDFs corresponding to each

Raw Thickness Map
Thickness Matrix

Obtain OCT Scans
Overall Thickness

Subjects Subjects
Logistic Regression

Logistic Regression
Step 0: Data acquisition   www.nature.com/scientificreports/ of these subjects for all retinal layers were constructed separately. In Figs. 2 and 3, we showed the PDFs for each of the three patient categories corresponding to the complete thickness map and the seven retinal layers. In each of these figures, along with the subject-wise PDFs, we also plotted the Karcher mean (solid line), which is the average PDF of the subjects in each category (details in Methods section). Each of these layers were considered separately to assess their utility in the pairwise classification problem. For each retinal layer and each pairwise comparison, we obtained the principal component scores X derived from the space of PDFs as described in the Methods section. Classification results from our density-based method for all three pairwise comparisons across seven retinal layers as well as overall thickness (denoted as ALL) are presented in Table 2. A leave-one-out cross-validation approach is employed for prediction. We compute the area under the receiver operating characteristic curve (AUC) and its 98.3% confidence interval (CI) which is Bonferroni corrected for multiple comparisons (to obtain a 95% overall confidence level). These are Wald-type CIs and are computed using the DeLong's variance estimator 11 using the pROC 12 package in R. We also report the sensitivity, specificity and the Brier score 13 .
In the comparison between NoDR and NPDR, the OPL was identified as the best discriminator, with an AUC of 0.679 with a 98.3% CI as 0.513-0.844. For controls vs NPDR, the GCL, INL and OPL were identified as statistically significant discriminators, with AUCs and the 98.3% CIs of 0.703 (0.560-0.847), 0.799 (0.682-0.916), and 0.802 (0.680-0.924), respectively. The INL shows higher utility in distinguishing between controls and NoDR with an AUC of 0.703 and 98.3% CI as (0.543-0.863). The 98.3% CIs for AUC for all the other layers and pairwise comparisons included 0.5. Prediction results derived from simple summary statistics (five-number summary: mean, minimum, maximum, first and third quartiles) are presented in Table 3. For the comparison between   Tables 2 and 3 indicate that the predictive models using the density-based features outperform the models with summary statistics in terms of the AUC and CI for the layers with reasonable discriminative power (lower bound for CI> 0.5).
For the comparison between NoDR vs NPDR, NFL performs best with summary statistics as predictors whereas OPL is most predictive with the density-based features. Note that although their AUCs are the highest (compared to other layers) for NoDR vs NPDR, the lower bounds of the associated CIs are close to 0.5. Hence, these results need to be interpreted with caution unlike other comparisons (e.g. INL in controls vs NPDR), where the indication of performance is stronger both in terms of AUCs and their CIs. Additionally, the evident differences (wiggly peaks) between average PDFs of NoDR and NPDR in OPL (see Fig. 3), are better captured by density-based features compared to summary statistics that fail to capture such nuances in the distribution of pixel intensity values. Similar arguments hold for the comparison between controls vs NPDR in GCL (see Fig. 2). However, the average PDFs of NFL in NoDR and NPDR (see Fig. 3) do not exhibit any evident visual differences. In this case, the PDF-based approach might not be adding additional insight compared to the summary statistics. All other discriminative layers show concordant performance in terms of AUCs with the density-based features performing better for the best predictive layers. Therefore, the ability of our functional data approach to capture www.nature.com/scientificreports/ small-scale changes in the PDFs of such high-resolution OCT images makes it a useful complement to existing approaches using summary statistics. In Fig. 4, we show the pairwise comparison of Karcher means between groups for those layers which show better predictive performance in Table 2. This visual representation of the mean PDFs shows the differences in thickness profiles between the two groups based on sample averages of the PDFs. For example, the INL is most predictive for the comparison between NoDR and controls with an AUC of 0.703 and a 98.3% CI as (0.543-0.863). The plot in the top-left panel in Fig. 4 displays a higher peak for controls at 40 µ m thickness; however, NoDR has slightly higher right tail. Specifically, from the mean PDFs we see that 29% of the ETDRS grid region has more than 50µm thickness in controls compared to 39% in NoDR. These percentages were computed using the quantiles of the Karcher means. The top-left panel in Fig. 4 indicates that the INL has slightly higher thickness in the NoDR compared to the controls. To further investigate this, we perform permutation-based hypothesis tests using the sample PDFs for each group of subjects. We test the hypothesis that the average PDFs of the groups are similar to each other or not. The details of the permutation test are given in the Methods section. In Fig. 4 and Table 4 we report the p-value from this test to quantify the statistical differences between the two groups of PDFs. The p-values are Bonferroni adjusted for multiple comparisons. For the INL in the comparison between controls and NoDR, the p-value for the permutation test is 0.0166 indicating statistically significant differences between the average PDFs of the two groups.
We perform a similar analysis for the other two pairwise comparisons. For the comparison between controls and NPDR, the GCL, INL and OPL show better predictive performance with the AUC and the 98.3% CI as 0.703 (0.560-0.847), 0.799 (0.682-0.916) and 0.802 (0.680-0.924), respectively. The p-values for the permutation tests for these three layers are 0.3453, 0.0001 and 0.0003, respectively. This indicates significant differences between the average PDFs of the thickness profiles of OPL and INL. However, the test does not reveal statistically significant differences in GCL between the controls and NPDR, which is consistent from the similar average PDFs for controls and NPDR in GCL (bottom-left panel in Figure 4), and its lower AUC compared to INL or OPL. From the mean PDFs for INL we see that 29% of the ETDRS grid region has a thickness of more than 50 µ m in controls compared to 40% in NPDR. Similarly, a thickness of more than 50 µ m is observed in 30% of the ETDRS grid region of OPL in controls compared to 38% in NPDR. Specifically, from Fig. 4 we see that the thickness of the INL and OPL is slightly higher in the NPDR compared to the controls. This is also indicated due to the higher peak for controls near 40 µ m thickness and a heavier right tail for NPDR. For the comparison between NoDR and NPDR, the OPL shows reasonable predictive performance with the AUC and the 98.3% CI as 0.679 (0.513-0.844). The p-value for the permutation test is 0.0174, which identifies significant differences between the Table 2. Results of the predictive performance using density-based features as predictors for pairwise comparisons among controls, NoDR and NPDR based on seven retinal layers and overall retinal thickness. ALL stands for the overall thickness. The layers which are statistically significant in demonstrating discriminatory power for each pairwise comparison (lower bound for CI > 0.5) are shown in bold font.    Specifically, 30% of the ETDRS grid region in OPL has a thickness of more than 50µm in controls compared to 38% in NPDR. The p-values for all the other layers and pairwise comparisons are provided in Table 4. Note that these p-values are Bonferroni corrected for the three pairwise comparisons. We further investigate these differences in thickness of the layers through the variability in the subjects within each group. Figure 5 shows the path sampled with − 2, − 1, 0, + 1, + 2 standard deviations around the mean along the first principal component direction transformed back to the space of PDFs. In the comparison between controls and NoDR for the INL, the sample path of PDFs in the first row of Fig. 5 indicates that these PDFs have slightly higher peaks at lower thickness values in controls compared to NoDR. Similar results can be observed for (a) the comparison between NoDR and NPDR in the OPL (second row in Fig. 5), and (b) the comparison between controls and NPDR in the INL and OPL (fourth and fifth rows in Fig. 5). However, the differences in GCL between the controls and NPDR (third row in Fig. 5) are not apparent, which is in agreement with our earlier result in terms of the smaller AUC for GCL. From these results, we see that we achieve better classification of subjects when each of the retinal layers are considered separately instead of solely looking at the complete retinal thickness. Each of these retinal layers are either thickened or thinned for a group of subjects and some of them offer better distinction ability for specific pairwise comparison between groups.

Discussion
We evaluated the utility of seven retinal layers and the overall thickness for the pairwise comparison between different severity levels of DR. Raw pixel intensities were transformed into density functions under a Riemanniangeometric framework and then principal component analysis was applied to generate sample specific features. Our density-based method was applied to a prototype data set and showed considerable improvement in predictive power when using individual layers as predictors instead of the total retinal thickness. According to our observations, OPL demonstrates the best performance in classifying NPDR versus NoDR. Similarly, for the classification of controls versus NPDR we see that OPL, INL and GCL provide the best performance. INL is able to distinguish between NoDR and controls. These findings are consistent with results from previous studies that demonstrated the disruption of inner retinal layers early in the course of diabetic retinopathy [14][15][16][17] . Hence, our analysis has identified potential biomarkers in distinguishing between each category of the severity of diabetic retinopathy and should be further investigated.
The postmortem human retina from donors with diabetes is found to have increased number of cell deaths in retinal neural cells even in areas away from the microvascular lesions 14 . Apoptosis of retinal neural cells was observed in diabetic rats induced by streptozotocin 14 . Consequently, their retina had 10% loss of surviving ganglion cells and significant thinning of IPL and INL after 7.5 months of streptozotocin. In a separate experiment, streptozotocin diabetic rats killed after 12 months also had thinner GCL and INL, though changes in INL were more remarkable 15 . Interestingly, several studies have reported retinal NFL defects as an early sign of DR [18][19][20] . Significant loss in retinal NFL was observed in patients with type 1 diabetes without retinopathy using scanning laser polarimetry 19 . Other studies 18,20 also showed that the retinal NFL becomes thinner and the nerve fiber defects in the retina increase as retinopathy progresses. However, our analysis suggests that changes in the INL associated with diabetes have higher differentiating capabilities than the early loss of NFL. In our study, we showed that the change in INL was the best indicator to differentiate NoDR from controls. This is validated from our results due to the discriminative capability and the p-values corresponding to INL in the comparison between controls and NPDR. Also, disorganization of retinal inner layers (DRILs) is associated with inner retinal thinning in patients with NoDR 10,21 . When comparing between NPDR and NoDR or controls, inner nuclear, outer plexiform, and ganglion cell layers were better predictors. Hence, future studies should investigate the mechanisms of these pathological changes, which may provide information about the progression of this disease and identify new therapeutic target for future treatments.
Our method has multiple advantages when compared to models which use summary statistics from the images as predictors. Our approach is based on a principled way of analyzing the heterogeneity in the thickness values from the high-resolution OCT images. We use the entire distribution of the pixel-wise thickness and capture the complete information presented. In addition, our method is invariant to scaling and shifting of thickness values and can be applied to thickness measured by different OCT instruments. Lastly, using PDFbased features (i.e. the principal component scores) instead of scalar summaries (i.e. the average thickness) to represent each sample allows the construction of more sophisticated machine learning models. By constructing densities which best model the heterogeneity in our sample data, we create more illustrative models of the sample distribution, which enable richer contextual inferences, such as our developments on layer variation mentioned above. Exploring the space of densities provides better insight into the heterogeneity of the thickness distributions. Consequently, variability in these density functions is effectively captured through the principal  www.nature.com/scientificreports/ components although the mean densities of certain groups visually appear to be similar. As part of our future works we plan to further investigate these findings in terms of the utility of the layers with larger sample sizes to establish comparison with controls matched for specific clinical characteristics. Additionally, a larger sample size for each pairwise comparison with a reasonable way to split the data into training and validation sets that captures the complete underlying variation would facilitate meaningful validation.
With the recent advancements in the field of data analysis and specifically its applications to imaging data, many machine learning approaches have been deployed to learn from the data. However, such algorithms usually require huge amounts of data and computational tools to efficiently train the model. In this specific application to diabetic retinopathy, obtaining large amounts of OCT data is challenging due to the associated costs. In such situations, making meaningful inferences from the data available is only feasible by employing model-based analytic approaches. With 38 NPDR, 28 NoDR and 41 controls, the sample size is a limitation of our OCT data to perform pairwise comparisons. However, our density-based modeling approach provides an efficient and innovative alternative by maximizing the utility of the information in the data. Further investigation involving more samples will facilitate comparison with matched controls.
In summary, this study showed that, the density of thickness values measured by the OCT instrument is able to capture the information that cannot be captured by the descriptive statistics. It can serve as an important biomarker of NPDR to aid the research and diagnosis of DR. Our proposed proof-of-concept study could be applied to larger cohorts of OCT data and other ophthalmology imaging data to understand the retinal structuring in DR which could arise due to subtle and small-scale changes in the thickness of the various layers.

Space of probability densities and their transformations. The PDFs computed based on the pixel
values inside the ETDRS grid, lie on a non-linear functional space. We exploit the differential geometry of this space, however, we restrict ourselves to the case of univariate densities on [0, 1]. For this purpose, we normalize the PDFs so that they belong to the Banach manifold dx . This FR metric has nice mathematical properties such as being invariant to reparameterizations of densities 25 . However, one of the drawbacks of this metric is the challenge it presents in computing the geodesic paths and distances. This challenge arises as the metric changes from point to point on F.
Square-root transformations of the PDFs. Since our goal is to build classification models, we want to work with data objects in a suitable representation of the space F , where the geometry is not as complex as in the Banach manifold F . Specifically, we work with transformations on the PDFs such that the complex non-linear space changes to a much simpler space where the computation is feasible. One such convenient choice of representation for PDFs, is the square-root transformation (SRT), h = + f 26 , since it allows to measure the distance between any two points in F as a standard L 2 Reimannian metric 27 . We omit the '+' sign hereafter for notational convenience. The inverse mapping is unique and is simply given by f = h 2 , and hence the space of the SRTs corresponding to F is given by Here H represents the positive orthant of the unit Hilbert sphere 28 . The tangent space at any point h ∈ H is defined as The FR metric can now be defined using the geometry of the space of SRTs. The geodesic distance between two densities f 1 , f 2 ∈ F , represented by their SRTs h 1 , h 2 ∈ H , is defined as the shortest arc connecting them on H , that is, dx . This is also the standard L 2 distance between h 1 , h 2 ∈ H . Geodesic distance between two PDFs can now be computed in an efficient manner as the L 2 Riemannian geometry of the unit sphere is well known.
Karcher mean for a sample of PDFs. The geometry of H can also be used to define an average (mean) PDF, which is a representative PDF of the pixel intensity values for a sample of subjects. This average PDF allows us to visualize and summarize the PDFs from the sample. Let f i denote the PDF corresponding to the pixel values inside the ETDRS grid for subject i for all i = 1, . . . , n and h 1 , . . . , h n be their corresponding SRTs. A generalized version of the mean on a metric space that can be used to compute the average density is called the Karcher mean 29 . Specifically, as the unique inverse transformation 27 of the SRT is given by f = h 2 , the sample average of PDFs f 1 , . . . , f n can be computed as f =h 2 , where h is the sample average on the space of SRTs. The sample Karcher mean h on H is the minimizer of the Karcher variance ρ(h) = n i=1 d(h, h i ) 2 L 2 , that is, h = argmin h∈H ρ(h) . Gradient-based iterative approaches are utilized to compute the Karcher mean on 30,31 . Note that the Karcher mean of the sample SRTs is an intrinsic average that is computed directly on (or equivalently F ), hence we have a mean which is an actual PDF.
The computations require important tools from differential geometry called the exponential and inverseexponential maps. For h ∈ H and δh ∈ T h (H) , the exponential map at h, exp : T h (H) → H is defined as exp h (δh) = cos(||δh||)h + sin(||δh||)δh/||δh|| , where ||δh|| 2 = Here the orthogonal matrix U contains the principal components or principal directions of variability, and the diagonal matrix contains the singular values. The number of subjects is usually smaller than the dimensionality of each tangent vector, i.e., n ≪ m . Note that the first r columns of U (denoted as Ũ ∈ R m×r ) span the r-dimensional principal subspace. The choice of r could be made based on the cumulative amount of variance explained by the first few principal components. We can express the data using coordinates in this subspace via principal coefficients computed as X = VŨ , where V ⊤ = [v 1 v 2 . . . v n ] ∈ R m×n . These principal coefficients X act as Euclidean coordinates corresponding to densities and can be used as predictors for downstream analysis.
Note that for our pairwise analysis the tangent space is considered at the Karcher mean of PDFs corresponding to only the subjects from the two categories in the pair being evaluated. The number of columns to include in X is determined as the number of principal components required to explain 99.99% of the total variation. Classification model. Our main goal is to build classification models to discriminate between any two categories of subjects. To address this we can consider the principal coefficients X as our predictors. These principal coefficients are predictors derived from the PDFs corresponding to the image of a retinal layer. Let us consider the variable y i ∈ {0, 1} as the response indicating the class membership of the subject i for all i = 1, . . . , n . For each of subject i, using the retinal layer map we can construct a PDF f i based on the pixel values in the ETDRS grid. Using the approach discussed earlier in this section, we can construct the corresponding principal component scores X ∈ R n×r which can further be used as covariates in the Euclidean space. Once we obtain the covariates X and the response y = (y 1 , . . . , y n ) , we can use standard classification algorithms to build discriminative models.
In this paper we considered logistic regression which is a generalized linear model used to model a binary categorical variable using numerical and/or categorical predictors. We assume a binomial distribution produced the outcome variable and we therefore want to model p i = P(y i = 1) , the probability that a subject belongs to the category 1 for a given set of predictors. More specifically, in the logistic regression we model the log-odds as a linear combination of the predictors as log p i 1−p i = x ⊤ i β , where x i is the ith row in X and β ∈ R r . Standard estimation approaches can be used to estimate the coefficients β . Once we obtain the estimated coefficients β , we can use them to predict the probability of class membership for a new subject. That is, for a new subject, we can estimate p new = 1/ 1 + e −x ⊤ newβ . We considered logistic regression, but other classification algorithms can also be used to build classification models. Hypothesis testing. We build a permutation-based hypothesis test to further investigate differences in the PDFs corresponding to the subjects for a given layer. That is, we want to investigate the hypothesis that the average PDFs for the groups (based on the binary response) are similar to each other or not. The similarity between these average PDFs is quantified by the geodesic distance between them. Using similar notation as before, let y i ∈ {0, 1} and f i denote the binary response and the PDF for subject i, respectively. Let f 0 and f 1 denote the Karcher mean of the PDFs for the subjects corresponding to y i = 0 and y i = 1 , respectively. We define d 0 = d(h 0 ,h 1 ) as the distance between f 0 and f 1 , where h 0 and h 1 are the SRTs corresponding to f 0 and f 1 . This value of d 0 serves as our test statistic for the hypothesis test.
We create the null distribution corresponding to the test statistic by randomly permuting the response labels y i between the subjects i = 1, . . . , n . Let y σ = (y σ (1) , . . . , y σ (n) ) denote a random permutation of the response labels y 1 , . . . , y n . Using the permuted response labels y σ and the original PDFs f 1 , . . . , f n we compute the group average PDFs f 0 σ and f 1 σ , and the distance between these average PDFs as d σ . We repeat this process m times by considering the permutations σ 1 , . . . , σ m and obtain the distance between the group average PDFs for each permutation as d σ 1 , . . . , d σ m , which serve as the null distribution for our test statistic d 0 . The p-value for this permutation-based hypothesis test is computed as m k=1 I(d 0 > d σ k )/m , where I(d 0 > d σ k ) = 1 if d 0 > d σ k , and 0 otherwise. Scientific RepoRtS | (2020) 10:15937 | https://doi.org/10.1038/s41598-020-72813-x www.nature.com/scientificreports/ Approval, accordance and informed consent. This study involving human subjects was approved by University of Michigan Institutional Review Board and conducted in accordance to the Declaration of Helsinki. All patients signed an informed consent form prior to enrollment.

Data availability
The data that support the findings of this study are available from University of Michigan Medical School but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with appropriate permissions of University of Michigan Medical School.