Comparing automated and manual assessments of tear break-up time using different non-invasive devices and a fluorescein procedure

To evaluate the agreement and repeatability of an automated topography-based method for non-invasive break-up time (NIBUT) analyses in comparison with two other NIBUT procedures, the fluorescein procedure (fBUT), and with the manual assessment with the same device. In the first experiment, a semi-randomised crossover study was performed on forty-three participants (23.1 ± 2.1 years). NIBUT measurements were collected in a randomised order, in both eyes of participants with EasyTear View + (Easytear, Rovereto), Polaris, and Sirius + (CSO, Firenze). Then a fBUT was collected. The overall measurement procedure was repeated in a further session (retest) on the same day. In a second experiment, a retrospective randomised crossover study was performed on eighty-five NIBUT videos previously recorded by the Sirius+. Two observers assessed manually the videos and the NIBUTs were compared with the automatic ones. In the first experiment, ANOVA showed a significant difference between the four measures in both eyes (p < 0.001). Significant differences were found in the paired comparisons between each NIBUT procedure and fBUT (Wicoxon; p < 0.05). Sirius+ resulted in agreement only with Polaris in the left eye. Correlations between all NIBUT procedures resulted in statistical significance in both eyes. All procedures showed very good test-rest reliability. In the second experiment, a significant correlation between automated and manual NIBUT was found, but also a significant statistical difference between the two measurements, although clinically negligible (0.3 s). The investigated NIBUT devices perform differently from each other (and from fBUT), so they cannot be considered interchangeable. The automated measure of NIBUT with Sirius+ has a negligible clinical difference compared to manual assessment on the same device.

The tear film is a thin structure (about 2.0-5.5 µm thick 1,2 ), extremely sophisticated in functioning and composition with a crucial role in maintaining ocular surface physiology.The assessment of tear film is of paramount importance in diagnosing dry eye disease (DED) 3 , One aspect of the tear film which is crucial to investigate for DED diagnosis is its stability [3][4][5] .Many factors determine the stability of tear film such as a sufficient and balanced production of the main components, which have to be spread efficiently on the ocular surface by the blinking system 5 .According to the three-layered model of the tear film, the stability is maintained by the prevention of evaporation by the outer lipid layer, the increase of volume and lubricity by the aqueous layer, and the reduction of hydrophobicity of the corneal epithelium by the inner mucin layer.
The lack of stability can be measured by the tear break-up time (TBUT) as the interval of time that elapses between the end of a complete blink and the appearance of the first break in the tear film 3,4 .The first procedure of TBUT, also known as fluorescein BUT (fBUT), was introduced by Norn in 1969 6 , who proposed to instill sodium fluorescein dye in the tears to detect breaks by using a biomicroscope and cobalt blue light.The "magic" number 10 s would indicate the cut off between normal and abnormal tear film 6 .Notwithstanding the clinical fortune of fBUT, which became the most common test for tear film assessment [7][8][9] , it has been largely recognised for its poor reliability 10,11 , (mainly linked to fluorescein invasiveness) 12 There have been proposed variations in the fBUT procedure to improve reliability such as a reduction and control of the amount of sodium fluorescein used [13][14][15] , or performing multiple measures 16 in different occasions 17 , etc.However, the best way to measure the stability of the tear film should be to use a non-invasive approach 3,18 that should avoid altering the the tear

Participants
To evaluate the sample size needed for the study, a priori analysis was performed by the G*Power software (version 3.1.9.4) on preliminary NIBUT and fBUT data measured with the same instruments and procedures used in this study and achieved at the Research Centre (hereinafter referred to as Lab) where the experiment was carried out.Through distribution data (mean and SD) and correlation between them, an effect size of 0.40 was worked out.Considering the need to verify the difference between the means of two repeated test (NIBUT vs fBUT), the analysis type was set on matched pairs t-test (two-sided).Fixing an α error and 1-β (power) at 0.05 and 0.80 respectively, the resulting sample size was N = 41.
Thus, forty-three participants (age: 23.1 ± 2.1 years; range 18.1-29.3years; sixteen males and twenty-seven females) were enrolled in the study on a voluntary basis.The inclusion criteria are reported in Table 1.Eventual dry eye symptoms were monitored by Ocular Surface Disease Index (OSDI) questionnaire (average score: 10.7 ± 10.0; range: 0.0-39.6).
All participants gave written informed consent, and all procedures were conformed to the Declaration of Helsinki and were approved by the Board of Optics and Optometry of the University of Milano-Bicocca (February, 11th, 2019).

Instruments
Three different devices were used to collect NIBUT data.Two devices, the EasyTear View + (Easytear, Rovereto, Italy) and the Polaris (CSO, Florence, Italy), have a similar structure with a cylindrical internal light source and a diffuser that allow to project diffuse cold light (white LED).The insertion of specific grids inside the internal cylinder light source of the instrument allows the projection of concentric rings onto the tear film (Fig. 1), thus the possibility to detect irregularities of the reflected image.Both instruments were mounted on a digital slit lamp (HR Elite, CSO, Florence, Italy) that allows video recording.The third device, the Sirius+ (CSO, Florence, Italy), is a Placido disc topographer integrated with a Scheimpflug tomographer (Fig. 1).The algorithm integrated in the dedicated software (Phoenix v.4.0,CSO, Florence, Italy) splits the Placido disc's ring projection into a pre-set number of circular sectors (tiles) with the same area.For each sector, the algorithm keeps a trace of the changes (disruption of the projected ring) in each sector's structure as time passes by.Only changes that persist until the end of the recording are considered as break-up, whereas a change that is restored to its original shape by the end of the recording is considered a false positive due to possible artifacts (e.g.small elements moving into the tear film layer).Disruptions of the projected ring that are visible since the beginning of the recording, such as eyelash shadow, are excluded from the processing.The algorithm can provide the first break-up regardless the sectors or the break-up map; the first break-ups are displayed topographically for each sector.

Procedure
All measurements were performed in the same Lab following the procedure reported in Fig. 2. The same researcher performed all NIBUT measurements by employing the three devices in a randomised order, with an interval between the different procedures of minimum 10 min to wash out any potential tear film destabilization due to the previous measurements 35,36 .For each instrument, three NIBUT measurements were achieved in a row for each eye.
EasyTear View+ and Polaris measurements were video recorded with the digital slit lamp.As for the Sirius+, the standard length of NIBUT video recording by the software was extended to 50 s to allow the detection of long break-up times.After the NIBUT measurements, the same researcher performed a standard fBUT three times in a row for each eye.fBUT was carried out always at the end due to its invasiveness compared to the NIBUT measurements.The fBUT was performed by fluorescein sodium strips (I-DEW FLO, Endot, UK) used according to Pult & Riede-Pult procedure 14 , with slit lamp (HR Elite, CSO, Florence, Italy), blue cobalt and yellow filters.The fBUT was video recorded with the digital slit lamp.Subjects, as for the non-invasive devices, were asked to blink twice and then trying to avoid blinking as long as possible.The fBUT was video recorded with the digital slit lamp.Test-retest reliability was evaluated performing the same series of measurements (according to the order randomly selected for each specific subject) in the same day at least 2 h after the first set of measurements.

Data analysis
All the following data analyses were carried out for right and left eye separately 37 .All data sets did not result normally distributed (Shapiro-Wilk test; p < 0.005), thus non-parametric statistics were used.The agreement among the four BUT assessment procedures was investigated by Friedman's test, then a matched comparison (Wilcoxon signed-rank test) was performed between each pair of measurements.Bonferroni adjustment was used to correct for multiple comparisons for post-hoc analyses.Spearman coefficient of correlation was calculated for each pair of measurements too.
Intra-observer repeatability was evaluated with the coefficients of precision (CP), repeatability (CR) and variation (CV).CP was calculated as 1.96 * s w (s w is the within-subjects standard deviation for repeated measures).CR was calculated as 1.96 * S 2 w * 2 that is the value under which it would be the difference between two measurements in the 95% of probability 38 .CV was calculated as s w divided by the overall sample mean.
Test-retest reliability was evaluated for each procedure (mean of the three measures at test and mean of the three measures at retest) by Intraclass Correlation Coefficient (ICC) based on mean measurement, absolute agreement, two-way mixed effects model 39 .The 95% confidence interval was calculated.Reliability is considered slight, fair, moderate, substantial and excellent if ICC is comprised between 0.01 and 0.20, 0.21 and 0.40, 0.41 and 0.60, 0.61 and 0.80, and more than 0.80 respectively 40 .A comparison between test and retest was also performed by matched-pairs Wilcoxon test.The statistical analyses were performed with SPSS version 2.8 (IBM SPSS Statistics, USA).

Sample
The present part of the study did not require a direct enrollment of participants and no ethical issue; therefore, the effect size of the experiment was determined using a post hoc procedure by the G*Power software (G*Power; version 3.1.9.4) for a comparison between means of two distributions by Wilcoxon test.Through distribution data (mean and SD) of automatic and manual NIBUT (first and overall measures) and correlation between them, with a sample size of N = 85, the effect size was worked out.Fixing an α error at 0.05, the power effect (1-β) resulted of 0.97 and 0.60 for the difference between the mean of the automatic measure with the first manual NIBUTs (both observers) and the overall mean of all manual NIBUTs (both observers) respectively.
Thus, eighty-five videos of the NIBUT procedure previously performed with Sirius+ (CSO, Florence, Italy) were selected according to the following criteria: -No blinking during the length of the recoding -The first break-up, detected by automatic assessment, should occur before 17 s (limiting the study to length compatible with tear film instability in which information about the difference between manual and automatic assessment is more useful) -No areas grossly out of focus -No missing fixation (due to movements of the eye or head) -No gross irregularities of the tear film (e.g., mucus, air bubbles, etc.).

Procedure
A flow diagram of the study design is represented in Fig. 3. Two observers with different clinical experience were chosen to evaluate the videos and investigate a possible influence of the experience on the manual (subjective) assessment of NIBUT.Observer 1 was a researcher and an eye care practitioner with more than 20 years of clinical experience.Observer 2 was a recently graduated optometrist with less than one year of experience in clinical practice.The two observers assessed each single video (played in freeware software on the same laptop) in random order, measuring the NIBUT three times in a row (first session).Before proceeding with the evaluation of the videos, common instructions on what should be identified as 'break-up' were provided to both observers.They were required to play the video and stop it as soon as the first break-up (discontinuity or break in the image of the rings) appears; the break-up time was recorded, and the video was rewound from the beginning to perform the other two measures.Observers repeated the assessment after 15 days (second session).The 85 videos were provided in random order (different from the one used in the first session) and without any information about the measures determined during the first session.
The same 85 videos were analysed by the automatic algorithm, the two observers were masked of the instrument results.

Data analysis
All data (first break-up time) used to assess the agreement between manual and automatic NIBUT measured by Sirius+ did not result normally distributed (Shapiro-Wilk test; p < 0.005).
Comparison between automatic (first break-up time) and manual measurements was performed by Wilcoxon test and Spearman correlation on the first manual measurement (mean of the first measure at the test session by the two observers), on the overall manual measurement (mean of all manual measures in both sessions) and on the mean manual measurement separately for each observer.The same statistical tests were used to compare manual NIBUT of the two observers.Bonferroni adjustment was used to correct for multiple comparisons for post-hoc analyses.Spearman coefficient of correlation was calculated for each pair of measurements too.
Friedman ANOVA for repeated measures was used to evaluated differences in the three NIBUT assessments performed by two observers in the two sessions.Intra-operator repeatability was calculated for each of the two observers using the same coefficients previously described in data analysis of the first experiment.
Test-retest reliability (between the two sessions) was evaluated for each observer (mean of the three measures at test and mean of the three measures at retest) by ICC 39 , as aforementioned in data analyses of the first experiment, and by matched-pairs Wilcoxon test.

First experiment: agreement and repeatability of different BUT measurement procedures
On the right eye, BUT (average of test and retest ± SD) resulted 12.0 ± 7.6, 12.8 ± 6.8, 14.8 ± 8.0, and 8.7 ± 5.2 s with the EasyTear View+, Polaris, Sirius+, and fluorescein-based procedure, respectively (Fig. 4a).On the left eye, BUT  www.nature.com/scientificreports/resulted 12.0 ± 8.2, 14.1 ± 9.8, 15.6 ± 7.8, and 8.6 ± 5.0 s with the EasyTear View+, Polaris, Sirius+, and fluoresceinbased procedure, respectively (Fig. 4b).Friedmann's analysis of variance showed a significant difference between the four measures in both eyes (p < 0.001).Post-hoc testing among the four procedures is reported in Table 2 along with correlations.All paired comparisons with fBUT showed significant difference for both eyes.Conversely, all paired comparisons between NIBUT procedures on the right eye were not significantly different, whereas on the left eye the comparisons between EasyTear View+, and the other two NIBUT procedures (Polaris and Sirius+) were significant, but the comparison between Polaris and Sirius+ was not.All correlations among procedures resulted significant (p < 0.001).
To investigate the relationship between invasive and non-invasive procedure, fBUT values were reported as a function of the three NIBUTs (Fig. 5).
Intra-observer repeatability for the four instruments, in the two sessions, was rather poor as it is possible to see from the high values of CP, CR, and CV reported in Table 3.
The results of test-retest are shown in Table 4 that reports the descriptive statistics of BUT, ICC, and p-values of paired comparison.ICC was substantial (between 0.61 and 0.80) for the EasyTear View+ measures on both eyes, for the Polaris in left eye, for the Sirius+ and the fBUT in the right eye.For the Sirius+ and the fBUT on the left eye the ICC was moderate and for the Polaris on the right eye was fair 40 .No test-retest difference was found for all procedures.Moreover, Bland-Altman plots of the test-retest measurements indicate a good agreement between the first and second measurement without any proportional bias (see Supplementary Figs.S1 and S2 online): all correlations (Spearman Rho) between the mean of test and retest and the difference retest-test were not significant.

Second experiment: agreement between manual and automatic NIBUT measured by Sirius+
The distribution of the automatic NIBUTs resulted (mean ± SD) 6.6 ± 3.6 s (range 1.2-16.9s).The manual NIBUT was (mean ± SD) 7.7 ± 3.8 s (range 2.0-20.7 s) and 6.9 ± 3.5 s (range 2.0-18.1 s) for the first measurement (only first session) and the overall measurement, respectively.A statistically significant difference was found between the automatic NIBUT and both the first manual and the average manual measurement (Wilcoxon test; p < 0.001).Figure 6 shows the scatterplot between the automatic NIBUT and two manual NIBUTs (first and overall average).Table 2. Paired comparisons (Wilcoxon test) and correlation (Spearman Rho) among the single four procedures in the two eyes.*Significant comparisons (after Bonferroni correction for multiple comparisons, alpha was lowered to 0.008) and significant correlations are reported in bold.Pearson correlation coefficient calculated between automatic and manual NIBUTs resulted 0.89 (p < 0.001), and 0.90 (p < 0.001) for the first and the overall and the overall average manual measurements respectively.NIBUT data achieved by the two observers in the two sessions are reported in Table 5 along with pair comparisons between the two observers for each measure, and pair comparisons between each manual NIBUT achieved by each observer and automatic NIBUT.All NIBUTs resulted significantly different between the two observers, but all were significantly correlated (all Pearsons correlations resulted higher than 0.85; p < 0.001).Friedman ANOVA for repeated measures showed a reduction in manual NIBUT in the 3 measurements in a row both for observer 1 (p = 0.03) and observer 2 (p < 0.001) in the first session, as well as in the second session (p < 0.001 for both Observers).All manual NIBUTs measured by observer 1 (except the second and third measures in the second session), resulted significantly longer than automatic NIBUT (between 0.3 and 1.6 s), Table 3. Coefficient of precision (CP), coefficient of repeatability (CR) and coefficient of variation (CV) for the measures with the four instrument/procedure in the first session (test) and in second session (retest).whereas for observer 2 the difference was significant only for the first NIBUT in first session (longer time), the second, third NIBUT in the second session and the average NIBUT in the second session (shorter time).However, all manual NIBUTs achieved by the two observers and the automatic NIBUT resulted strongly correlated (all Spearman Rho higher than 0.83; p < 0.001).Table 6 shows the statistical coefficients of intra-operator repeatability (among the three measures performed in a row in each session), separately for the two observers in the two sessions.Coefficients show good intraoperator repeatability in both observers.

Procedure
Table 7 reports the descriptive statistics of manual NIBUTs achieved by the two observers and their average at test and retest, the ICC between test and retest measures, and p values of paired comparison between test and retest (Wilcoxon test).ICC was excellent (over than 0.80) 40 for both observers.However, NIBUTs at retest resulted significantly shorter than test for both observers (p < 0.001).Finally, Bland-Altman plots of the test-retest measurements (see Supplementary Fig. 3S online) show a proportional bias for observer 1 (Spearman Table 5. Descriptive statistics (Mean ± SD and range) of NIBUT (s) manually measured by the two observers (Obs1 and Obs 2) in the two sessions (N = 85).Paired comparisons between observers for each manual measure (Wilcoxon test in fifth row) and correlation (sixth row), as well as paired comparisons between automatic and each manual measure achieved by the two observers (Wilcoxon test, tenth and eleventh row) and correlation (Spearman Rho; twelfth and thirteenth row) are also reported.*Significant comparisons (after Bonferroni correction for multiple comparisons, alpha was lowered to 0.017) and significant correlations are reported in bold.

Discussion
Two different experiments were carried out to evaluate the NIBUT assessment of Sirius+, a recently developed Placido-based topographer integrated with a Scheimpflug tomographer.Even though its clinical application has been already reported in the literature [41][42][43][44] , no data about its level of agreement with other devices/procedures, and repeatability is available.To clarify the discussion of the results obtained in the two experiments, the outcomes have been divided into specific paragraphs.

Agreement between NIBUT procedures and fBUT
The first part of the study showed that NIBUT was longer than fBUT, independently from the device employed, and this result is in agreement with the literature 12,15,[45][46][47] .However, elsewhere in literature automatic NIBUT was also found to be shorter than fBUT 27,48 .It has been proposed that the shorter fBUT might be induced by the instillation of fluorescein which would reduce the stability of the tear film 15,19 .When the amount of instilled fluorescein is reduced, the difference between NIBUT and fBUT decreases 15 .However, it has also been found that increasing the delivered volume of fluorescein solution by the glass rod technique (micropipette) lengthened fBUT 47,49 .In a recent paper, NIBUT measurements were carried out with Sirius+ without and with fluorescein that caused a prolongation in the NIBUT, labelled as "de-naturation" of the tear film 50 .
In the present study, a caveat of the difference between fBUT and NIBUTs might be the fact that the sequence of the measurements was not fully randomised: due to its invasiveness fBUT was carried out always at the end.Despite washout intervals, this practice may have contributed to decreased tear film stability, increasing the difference between fBUT and NIBUTs.
Furthermore, another source of shorter times with fBUT might be the different area covered by fBUT and NIBUT assessments.In many participants, the shadow of the lashes on the superior area of the Placido rings (Fig. 1) made the measurement impossible in this area for both the manual and the automated assessment of NIBUT procedures.Moreover, the Placido rings were reflected only in a reduced area of the cornea (Fig. 1).This made the area covered by the fBUT procedure larger than the NIBUT procedure, then with the fBUT procedure, it was possible to detect breaks in zones not covered by NIBUT procedures.

Agreement between NIBUT procedures
Looking at the NIBUT procedures, the first thing to highlight is that the subjective assessment of NIBUT of the EasyTear View+ and the Polaris are extremely close to the findings of Bandlitz et al. 30 (12.2 ± 6.6 s and 12.0 ± 6.4 s, respectively), who collected data with the same paradigm (two sessions in the same day) on individuals with very similar age (24.2 ± 3.6 years vs 23.1 ± 2.1 years in this work).The present study showed no difference between Polaris and Sirius+ in both eyes and between EasyTear View+ and Sirius+ in the right eye.However, few comparisons displayed a statistical difference (see Table 2).This result is not clear to interpret.Considering that the three NIBUT procedures are non-invasive and based on a "concentric ring grid", the results might be expected to be similar, as reported for four NIBUT devices (EasyTear View+ , Keratograph 5 M, Polaris, and Tearscope Plus) 30 .However, other studies evidenced a poor agreement between different NIBUT procedures 51,52 .Furthermore, NIBUT values in healthy population, measured by grids or Placido discs, have shown extreme variability, ranged between 10 and 50 s 19,22,23 .Therefore, it should be considered that many factors could induce variability, such as different age and ethnicity of the subjects assessed, the various sizes, brightness and coverage (e.g., due to corneal curvature) of Placido discs 21 , and the fact that for some instruments it is still requested a manual (subjective) judgment 46 .Earlier studies comparing automatic and manual NIBUTs consistently found differences 28,31 , but variations in instrument features rather than the detection method (automatic vs manual) may have contributed to these differences.For example, the comparison between automated software to achieve a NIBUT by a topographer (Keratograph) and a manual NIBUT performed by Keeler Tearscope showed a shorter time with the former 28 .Also Markulli et al. found that NIBUT of healthy people was significantly greater with the Tearscope-Plus (15.9 ± 10.7 s) than NIBUT achieved with Oculus Keratograph 5 M (8.2 ± 3.5 s) 31 .These results might be because these releases of the software were extremely sensitive to minimal changes in the projected rings (deformation).As for the difference between the two eyes (no difference in right eye among the 3 procedures, and differences limited to EasyTear View+ and the two other NIBUTs in left eye) the only difference in the procedure lens wearers Absence of any known ocular pathology and not being subjected to refractive surgery or ocular drug treatment Absence of any known general pathology Not taking any ocular or systemic medication known to affect the ocular surface Not being in state of pregnancy Able and willing to adhere to any study instructions and complete all specified evaluation Read, indicate understanding of, and sign informed consent

Figure 1 .
Figure 1.Example on the same subject of the grids reflected by tear film by using EasyTear View + (left), Polaris (centre) and Sirius+ (right).

Figure 2 .
Figure 2. Flow diagram of the study design.After the enrollment, participants run the first set of NIBUT measurements (test), the same series of measurements were retaken (retest) in the same day at least 2 h after the first set of measurements.

Figure 3 .
Figure 3. Flow diagram of the study design adopted in the second experiment.After videos' selection, the two observers assessed the videos separately in random order three times (first session).The assessment was repeated (retest) after 15 days.

Figure 4 .
Figure 4. Box and whisker plot of the BUT distribution with the four procedures in right eye (a) and left eye (b).A significant difference was found among the four measures in both eyes (Friedman ANOVA; p < 0.001).

Figure 5 .
Figure 5. Scatterplot of the BUTs between fBUT and the three NIBUTs in right eye (a) and left eye (b).

Figure 6 .
Figure 6.Scatterplot between the automatic NIBUT and the two manual NIBUTs analysed: first measure (grey circles and continuous grey regression line) and overall average (dotted circles and dotted black regression line).

Table 1 .
Inclusion criteria for subjects enrolled in the study.

Table 6 .
Coefficient of precision (CP), coefficient of repeatability (CR) and coefficient of variation for the manual measures of NIBUT performed by observer 1 and observer 2 in the first session (test) and in second session (retest).Rho = − 0.29; p = 0.008), indicating that the longer the NIBUT the shorter the retest compared to test.No proportional bias was found for the observer 2 (Spearman Rho = − 0.04; p = 0.74).