Evaluation of the precision of contrast sensitivity function assessment on a tablet device

The contrast sensitivity function (CSF) relates the visibility of a spatial pattern to both its size and contrast, and is therefore a more comprehensive assessment of visual function than acuity, which only determines the smallest resolvable pattern size. Because of the additional dimension of contrast, estimating the CSF can be more time-consuming. Here, we compare two methods for rapid assessment of the CSF that were implemented on a tablet device. For a single-trial assessment, we asked 63 myopes and 38 emmetropes to tap the peak of a “sweep grating” on the tablet’s touch screen. For a more precise assessment, subjects performed 50 trials of the quick CSF method in a 10-AFC letter recognition task. Tests were performed with and without optical correction, and in monocular and binocular conditions; one condition was measured twice to assess repeatability. Results show that both methods are highly correlated; using both common and novel measures for test-retest repeatability, however, the quick CSF delivers more precision with testing times of under three minutes. Further analyses show how a population prior can improve convergence rate of the quick CSF, and how the multi-dimensional output of the quick CSF can provide greater precision than scalar outcome measures.

exceed the dynamic range of consumer electronics displays, potentially leading to ceiling effects in vision tests 17,18 ; often, cathode-ray tubes and custom electronic circuits are still being used in the laboratory to avoid such effects 19 .
However, advances in computational power and display technology have opened avenues for novel methods of visual assessment to overcome these issues. For example, the quick CSF method 20,21 uses a computationally expensive algorithm and optimizes stimulus selection by computing the expected information gain over a very large set of possible stimuli and the probability distribution of possible CSFs, given the history of previous trials. Specifically, the human contrast sensitivity function can be described by four parameters -namely peak sensitivity, peak frequency, bandwidth, and low-spatial frequency truncation 22,23 . The quick CSF method exploits this observation and computes the probability distribution over possible CSFs, using Bayes' theorem: with M a four-dimensional tuple describing a CSF and D the set of trials. The range of different CSFs, each described by a different M, spans the entire gamut from very low to excellent vision, and initially (before the first trial, i.e. before anything is known about the subject), the algorithm assumes the same probability for each of these CSFs; alternatively, other sources of information, such as knowledge about the distribution of CSFs in a particular subject population, can be used to model a more informative initial probability distribution, or prior. For each trial, the algorithm then chooses that combination of stimulus frequency and contrast that maximizes expected information gain about this CSF distribution. For example, the very first trial should not be trivially easy (such as in acuity charts that start with the largest letter) because the outcome (a correct response) is so likely for almost every subject (and almost every possible CSF) that it is probably uninformative for this particular subject, and the likelihood of different CSFs. Instead, the algorithm typically chooses a medium-sized, medium-contrast stimulus for the first trial (stimulus selection is stochastic in order to prevent getting stuck in local optima), and the probability distribution over all CSFs is updated according to the subject's response. For example, an incorrect response to such a medium-difficulty stimulus would make CSFs that represent excellent vision less likely, and thus the stimulus for the subsequent trial should not be difficult. Over the time course of the test, the quick CSF repeats these steps to home in on stimuli that are near the subject's threshold and that provide the most information about the probability space of different CSFs. This means that a high resolution in the stimulus space (many different contrast levels and spatial frequencies) is useful because it allows the algorithm to adapt more finely to the current subject; however, not all possible stimuli need to be presented during the experiment. Finally, a sample is taken from the probability distribution of CSF models M and the median sensitivity over this sample can be computed for any spatial frequency based on the model parameters.
We further extended our tablet-based implementation 24 to use bandpass-filtered Sloan letters instead of gratings. The increase in the number of possible responses (ten instead of two) reduces the guessing rate and thus increases statistical efficiency 21 . In the present paper, we evaluate the quick CSF method in a population of healthy observers, and we further investigate the effect on choosing different priors on its robustness. Implemented on an iPad 24 , our system allowed rapid testing (median testing time less than three minutes, including entering subject details) and yet precise assessment of the whole CSF.
In order to establish a baseline for how much information can be captured ultra-rapidly, in a single trial, we also collected a single-trial assessment of the peak of the CSF: using the iPad's touch screen, observers were asked to indicate the point of highest sensitivity on a "sweep grating" picture that varies spatial frequency and contrast along the two image dimensions. We analyze the relationship between quick CSF results on the one hand and single-trial assessment of contrast sensitivity on the other hand, and also compare the test-retest variability of both assessments, using a method from information retrieval that is less prone to artefact than commonly used test-retest measures.
A necessary requirement for a precise test is that two measurements of the same true value return similar results; in other words, a precise test should be highly repeatable. However, this requirement is not sufficient, as reliable tests are not necessarily precise 25 . Commonly, test-retest variability is evaluated by the Bland-Altman Coefficient of Repeatability 26 ; despite its shortcomings 27 , the intra-class correlation coefficient also is often reported.
We instead propose to assess test-retest variability based on concepts from information retrieval, namely by a measure we call Fractional Rank Precision (FRP) 28 . Intuitively speaking, we want to identify the test-retest pair of measurement for a subject, given only the test measurement for this subject and the set of retests for all subjects. If a subject's retest score is the same as their test score again, and none of the other subjects have the same retest score, this subject's retest is uniquely identified by the test, and we assign a precision of 1.0. Conversely, if a hypothetical retest just flipped the sign of the test, identification would be very poor, and we assign a precision of 0.0 (or for a finite number N of subjects, 1/N). Generally, each subject is assigned the fractional precision (1 − (rank − 1)/N) of their individual retest value when all subjects' retest values are sorted by their similarity (e.g. Euclidean distance) to the individual test. We repeat this for each subject and FRP then is the average of all subjects' precisions. If test and retest scores are distributed randomly, the expected FRP will be 0.5. Notably, FRP also takes into account that identical test outcomes for different subjects, e.g. due to large step sizes, reduce the ability to identify the test-retest pair. For example, in typical healthy, best-corrected cohorts, almost all subjects have one of very few different logMAR scores at or near 20/20. In the extreme, a very coarsely resolved test, such as light perception in a sighted cohort (all retest scores with the same distance to each test score), would yield an FRP of 0.5, i.e. chance performance.
As an FRP example, imagine three subjects who have test scores of 1.0, 1.5, and 1.2, and retest scores of 1.0, 1.3, and 1.4, respectively. The absolute differences of the three retest scores to the test score (1.0) of the first subject are (0.0, 0.3, 0.4) and thus the first subject's retest is closest to its test score, i.e. has rank 1. For the test score of the second subject (1.5), the absolute differences are (0.5, 0.2, 0.1) and thus the second subject's retest has rank 2. For the third subject (test score 1.2), the absolute differences are (0.2, 0.1, 0.2), and its retest has rank 2.5 (because first and third subjects' tests tie for rank 2). The Fractional Rank Precision therefore is 1/3 × (1 + 2/3 + 1/2) = 0.72. This formulation has the benefit that it expresses test-retest variability in terms of inter-subject variability, without resorting to absolute values that may make it difficult to compare the repeatability of different tests with different absolute score ranges. A further benefit is the penalty for test scores that are quantized at the cost of precision 25 : a test that returns only one score may have perfect repeatability, but fails to discriminate between different subjects (and, by extension, changes due to disease progression or treatment effects).

Results
Single-trial assessment. The two panels in Fig. 1 show the relationship of peak frequency and sensitivity as obtained by single-trial assessment on the one hand and the AULCSF of the quick CSF (after rescoring with the population prior) on the other hand. Solid lines indicate univariate linear regression and shaded regions indicate 95% confidence intervals. Both dimensions of the single-trial assessment are correlated with AULCSF (p ≪ 0.01, R 2 = 0.34 for peak sensitivity and R 2 = 0.40 for peak frequency); multivariate regression explains more of the variance (p ≪ 0.01, R 2 = 0.54) with no statistically significant interaction between peak sensitivity and peak frequency (p = 0.59).
Bland-Altman Coefficients of Repeatability were 0.497 and 0.476 log10 units for peak sensitivity and frequency, respectively. Fractional Rank Precision for both features was 0.715 and 0.681, respectively; combining the two features and performing a nearest-neighbour search improved FRP to 0.769.

Test times.
Test times for the quick CSF are shown in Fig. 2. These test times were computed by comparing the time stamps of condition onsets; notably, they therefore include the time between conditions where the tablet changed back and forth between subject and experimenter, data entry of subject and condition details, re-adjusting the viewing distance, and recording both single-trial assessment and quick CSF assessment (50 trials). The minimum condition interval time measured like this was 129 s; median, mean, and standard deviation were 175, 190, and 52 s, respectively, computed over all subjects. Figure 2 shows test times as a function of the temporal order of conditions; test times become shorter with practice.
Effect of population prior. Figure 3 shows the effect of choosing a different prior over the CSF search space on Fractional Rank Precision of the summary statistic AULCSF, plotted as a function of trial number. Both the uniform and the population prior lead to very similar FRP results over the first few trials, as only very little information about the true CSF is available. For the time range between 10-30 trials, the analysis that was initialized with the population prior shows better convergence, however, as physiologically implausible parameter combinations are already suppressed. For longer test sessions, the difference between FRP for the initial priors diminishes as both approaches converge (after 50 trials, FRP of 0.867 and 0.864, respectively).  lowest (best) CoR (0.17) of all features, but also the lowest (worst) FRP (0.6), whereas AULCSF has best FRP, but medium CoR (0.87 and 0.24, respectively). Along the horizontal axis, the Bland-Altman plots show that AULCSF has a much wider range than sensitivity at 41 cpd. The latter suffers from floor effects, and thus has little predictive power despite a seemingly better repeatability as measured by CoR. Figure 6 plots the absolute test-retest difference as a function of visual performance, using the AULCSF metric. Linear regression shows no significant effect of visual function (p = 0.46, R 2 = 0.01); in other words, subjects with poor vision exhibit similar test-retest variability as subjects with excellent vision.

Multi-dimensional test-retest variability. Fractional Rank Precision values for different features of the
quick CSF are shown in Fig. 7, with dashed lines indicating chance level (0.5) and FRP for single-trial assessment (0.769). Among scalar features, the AULCSF yields higher FRP than CSF acuity (after 50 trials, FRP of 0.867 and 0.840, respectively) as it summarizes over a large range of spatial frequencies. Bland-Altman Coefficients of Repeatability were 0.238 and 0.2 log10 units, respectively. Sensitivity at 1.5 cpd, in contrast, performed little better than single-trial assessment. Multi-dimensional features, however, identify test-retest pairs with greater precision. The combination of CSF acuity and the sensitivity near the presumed peak of the CSF (1.5 cpd) already yields a FRP of 0.886; adding the summary statistic AULCSF and an additional mid-frequency feature (6 cpd) results in  the highest FRP of 0.902. Notably, FRP seems not to have reached ceiling after 50 trials for the multi-dimensional features, i.e. even more precision may be obtained by running the quick CSF method for more trials.

Discussion
Contrast sensitivity has been recognized as a more informative outcome measure than visual acuity, the currently prevalent measure of visual function 4,5,[29][30][31][32] . However, precise assessment of the two-dimensional CSF so far has been too time-consuming to be employed in regular clinical care; paper charts, which are limited in stimulus resolution and scoring complexity, use shortcuts that limit precision 17,[33][34][35] .
A multitude of studies have investigated the repeatability of established contrast sensitivity charts as a proxy to precision. Absolute numbers are not necessarily comparable across studies because of different scoring rules 15 , but more importantly also because of an often observed greater test-retest variability in cohorts with lower visual performance [36][37][38] ; for our test, we did not observe such heteroscedasticity (Fig. 6). Published CoR ranges for the Pelli-Robson chart include 0.13-0.21 15 , 0.14-0.41 37 38 ; around 0.2 log10 units for peak sensitivity and 0.1 log10 units for peak spatial frequency based on a finger trace of a sweep grating (measured for one observer) 43 , and a range of 0.26-0.44 log10 sensitivity for individual spatial frequencies between 1.5 and 18 cpd 44 . The latter numbers are roughly comparable to the CoRs for individual frequencies observed in the present manuscript (0.17-0.39 log10 units). However, we demonstrated that the widely used Bland-Altman Coefficient of Repeatability is susceptible to scaling artefacts, and thus proposed to use Fractional Rank Precision instead to compare precision of different tests. Unlike CoR, FRP can also directly be used to compare tests that have outputs of different dimensionality. As a caveat, however, we note that FRP depends on the variability of the underlying test population, and thus FRP results cannot be compared directly across studies.  In our present study, we used two different measures of contrast sensitivity that have very short testing times, and evaluated their validity in a cohort of healthy observers with varying refractive error. The first measure required subjects to tap a tablet device only once, and therefore probably constitutes the lower limit of testing time that is practically achievable. Recently, other groups have employed a similar technique to assess contrast sensitivity 43,[44][45][46] ; these authors asked their subjects to trace out the boundary between visible and invisible pattern over a range of spatial frequencies and thus collected potentially more information than just the peak of the CSF; our single-tap measure was substantially more variable than Mulligan's finger-tracing 43 . However, the precision of sweeping hand or finger motions is still limited and, more importantly, the subjective decision where to place the perceived boundary is likely vulnerable to criterion shifts; for these reasons, forced-choice methods are the preferred standard in psychophysical testing 47 .
Despite these fundamental problems, single-trial assessment correlated reasonably with quick CSF assessment, with an R 2 > 0.5 and a test-retest precision that corresponds to about 10-15 trials of quick CSF. However, the Bland-Altman CoR was more than twice as high for peak-tapping as for the quick CSF after 50 trials, and the quick CSF in principle could be run for more trials to achieve even greater precision. Furthermore, it should be noted that the position of the sweep grating was not randomly shifted before each test session, so that subjects might have remembered (and reproduced) the location they had tapped on the screen in previous sessions 43 . Ultimately, single-trial assessment might thus serve as an ultra-rapid screening tool, but lacks the precision to track subtle changes in vision due to disease progression or treatment effects.
One potential use of such ultra-rapid screening tool might be the initialization of the quick CSF method with a subject-specific prior. We demonstrated (in Fig. 3) that the use of a population prior over the CSF parameter space already led to faster convergence than a uniform prior; furthermore, it is important to note that we here used the population prior only during a re-scoring run of the algorithm. In this case, stimulus selection was fixed and determined by the output of the quick CSF method during actual data collection, so that the rate of convergence might even be underestimated for a more informative prior. However, as discussed above, robust test-retest repeatability might come at the expense of precision to detect change, and further experiments using well-controlled changes to visual function therefore are needed to determine the optimal shape of a prior to avoid overfitting.
Kim et al. 48 recently developed a Hierarchical Adaptive Design Optimization (HADO) procedure that achieves greater accuracy and efficiency in adaptive information gain, by exploiting two complementary schemes of inference (with past and future data). HADO extends the standalone quick CSF to a framework that models a higher-level structure across the population, which can be used as an informative prior for each new assessment. In turn, the parameter estimates from each individual enable the update of the higher-level structure. The judicious application of informative priors used by HADO improves the quick CSF efficiency by ≈ 30%. In future research, we will apply the HADO procedure as a more mathematically rigorous method to utilize population statistics to improve the efficiency and precision of quick CSF.
The quick CSF method estimates sensitivities for a wide range of spatial frequencies. The AULCSF already yields a useful summary statistic [49][50][51][52][53][54] , but by definition cannot exploit the full benefit of assessing the whole CSF, for example to disambiguate whether a disease might shift the CSF downwards (lower sensitivities) or left (lower peak frequency). Going beyond scalar descriptors, the combination of acuity and peak sensitivity therefore has been proposed to describe the CSF 55 . However, a priori the frequency for which sensitivity is highest cannot be known and the peak therefore must be approximated by a low-SF sensitivity. Large-scale modelling has also shown that two parameters are not enough to accurately model the CSF 23 . Such two-point measurement must assume a fixed shape of the CSF (e.g. ref. 56), which may be a reasonable first-order approximation, but likely does not hold for possible changes in optical correction, disease status, eccentricity 57 , age 22,58 , or illumination 44,59 ; monocular vs. binocular testing and during development 53,60 .
We performed FRP analysis on higher-dimensional combinations of CSF features that might capture CSF variation more comprehensively. To this end, we used a simple nearest-neighbour approach with unit-normalized feature scales, and indeed FRP scores improved over those for scalar features such as AULCSF or CSF acuity; we note that more sophisticated Machine Learning techniques that differentially weight the individual CSF features might improve FRP scores even further. In our present analysis, best results were achieved with a combination of low-, medium-, and high-SF sensitivities (1.5 and 6 cpd and CSF acuity) and the summary statistic AULCSF; notably, this four-dimensional combination also performed better than the combination of (near-peak) sensitivity at 1.5 cpd and CSF acuity that would correspond to a two-point measurement.
In summary, we here showed that rapid assessment of the whole contrast sensitivity function on a portable device is feasible, and that the quick CSF method delivers highly precise results in testing times of less than three minutes. This efficiency can be further improved by an informed population prior or even further by the use of an ultra-rapid, single-trial prior assessment. Ultimately, the real value of a vision test is shown by its ability to detect small changes in visual function due to disease progression or treatment, and further studies are needed to evaluate the sensitivity of the quick CSF.

Methods
Data collection. Experimental data were collected from 101 subjects (aged 14-75 years with a mean of 22) who were recruited among students and faculty of the Institute for Psychology and Behavior, Jilin University of Finance and Economics, Changchun, Jilin Province, China. Subjects gave informed consent and the experimental design followed the principles of the Declaration of Helsinki and was approved by the local Ethics Committee (protocol IPBWH1101). Based on self-reported optical refraction, 63 subjects were myopes (mean optical correction − 3.9D, s.d. 1.6D) and 38 subjects were emmetropes (no optical correction).
Subjects were tested in all applicable combinations of {right eye, left eye, both eyes} and {without correction, with correction}. To assess repeatability of our device, one randomly chosen condition was then repeated at the end of the session. While repeat measures taken over a longer time interval may be closer to clinical practice, intra-session repeatability measurement excludes subject variability over longer time scales and is thus a common method to assess the measurement itself 34,38,40,43,61,62 .
Overall, 593 data sets were collected; these are publicly available at http://www.michaeldorr.de/quickcsf. During the experiment, subjects were seated and held an iPad 4 in portrait orientation at a viewing distance of 60 cm. Mean screen luminance was set to 185 cd/m 2 . Prior to each tested condition, the single-trial assessment was recorded; to this end, subjects were shown a sweep grating as in Fig. 8 and asked to touch the peak of the perceived "mountain" that is created by the transition from visible to invisible contrast. The grating contrast varied vertically from 0.2% at the top of the screen to 100% at the bottom of the screen, and spatial frequency varied horizontally from 0.29 cpd at the left side of the screen to 19.2 cpd at the right.
For each of 50 trials, one of 11520 possible stimuli (10 bandpass-filtered Sloan letters, peak frequency 4.5 cycles per letter; 24 spatial frequencies from 0.64 to 41 cycles per degree; 48 contrast levels from 0.2 to 100%) was selected by the quick CSF method. The stimulus was then briefly presented (500 ms), followed by an array of letters from which subjects had to choose their response on the touch screen.
Statistical analysis. Selection of prior for the quick CSF method. The quick CSF uses Bayesian inference to update its belief (i.e. the best estimate of the true CSF) after each trial. Before the first trial, however, no data have been collected and the initial belief has to be described by a prior probability distribution over the parameter Subjects were asked to touch the peak of the perceived "mountain".
space. Such a prior could be derived from previous knowledge, e.g. the statistics of visual function in the general population or an individual's earlier test results. In our experiment, however, we approached data collection in the most principled way and chose a uniform distribution over the parameter space. While this makes the least assumptions, a drawback of this approach is that physiologically implausible or even impossible parameter combinations such as a very high peak frequency together with a very low bandwidth initially are equally likely as more plausible combinations, and thus convergence may be slower than theoretically possible. In order to investigate this effect after data collection had concluded, we re-scored all subjects' responses with an additional run of the algorithm per test session. For this analysis, the posterior over the whole test set (593 data sets) was used as the prior, and we report results both for this 'population' prior and the original 'uniform' prior.
Test-retest variability of scalar features. From the final estimates of the CSFs, we computed threshold sensitivity for each of the 24 spatial frequencies that corresponded to our stimulus set (note that not all stimuli were necessarily shown to each observer). For a broad summary statistic, we also computed the area under the log CSF (AULCSF) in the spatial frequency range from 1.5 to 18 cpd. For these 25 features, we computed Coefficients of Repeatability (1.96 times the standard deviation of test-retest differences) and the Fractional Rank Precision metric (FRP): for each subject's test feature f subject , we sorted all the N observers' retest values f n by their distance to f subject , and assigned a rank r(observer n ) accordingly. The fractional rank for one subject then was − − N r N ( (subject) 1) , and overall FRP was the average of fractional rank over all observers (each observer being the "subject" once): For increased robustness, FRP analysis was run in both directions and results were averaged, i.e. all retests were ranked by their distance to each test once, and all tests were ranked by their distance to each retest once.
Test-retest variability of multi-dimensional features. By definition, the CSF is a multi-dimensional function. While the AULCSF can already provide a very useful summary statistic in many cases, higher-dimensional descriptions of the whole CSF may more precisely capture changes in the CSF and thus track an individual's performance. We therefore performed FRP analysis on multi-dimensional feature vectors and performed a nearest-neighbour search to identify possible test-retest pairs.
In order to keep the number of possible dimensions small, we did not use all 25 features from above. Instead, we computed the following features on the CSF: i) AULCSF; ii) CSF acuity, the spatial frequency for which contrast threshold reaches 100%; iii) sensitivities for six spatial frequencies as mandated by the FDA 63 , namely 1.0, 1.5, 3, 6, 12, and 18 cpd. In order to reduce the impact of different numerical ranges, these features were standardized to zero mean and unit standard deviation, and we averaged the results of running our analysis in both directions again. For results shown in Fig. 7, we present a selection of individual features and their combinations.