Introduction

Instruments to record measurements (eg intraocular pressure, corneal thickness, axial length, etc) should only be used if we know they are reliable. In addition, instruments developed for newly quantifiable measurements (eg posterior capsular opacification,1, 2, 3 ocular blood flow4, 5, 6) must also be shown to be reliable before they can be applied either in clinical or research settings.7 Reliability means that the measurements that the instrument records are reproducible at different time intervals (test–retest reliability) and that those observers making the measurements produce repeatable results, both for the same observer over a period of time (intraobserver reliability) and between different observers on the same subject (interobserver reliability).8, 9, 10, 11, 12, 13, 14, 15 In addition, reliability is used in the context of assessing agreement between one method of measurement and another (method comparison or parallel reliability). Thus, reliability (as well as sensitivity and specificity) is a prerequisite to using any instruments of measurement and forms a major component of ophthalmic research.16

However, techniques of data analysis employed in studies assessing reliability in the ophthalmic literature vary tremendously17, 18, 19, 20, 21, 22 and studies often use techniques that are inappropriate for the task they are set.17, 23, 24, 25, 26, 27, 28 In this paper, we review current statistical techniques employed in reliability/agreement studies and provide a framework to help the ophthalmologist decide on the most appropriate statistical method.

Continuous vs categorical data

Statistical techniques for agreement studies depend on whether data are continuous (derived from a possible range of values or an underlying continuum) or categorical.

Analyses of continuous data in agreement studies

Correlation

This is a very commonly performed technique used to assess level of agreement, but is inappropriate as it measures association and not agreement. A highly significant and large value for the correlation coefficient (r) can coexist with gross bias.17, 23, 25, 26, 29, 30, 31 For example, when comparing the performance of two observers, observer A may consistently overestimate the result when compared to observer B (Figure 1). A highly significant value for r would be achieved, and this could be misinterpreted as revealing good agreement between both observers. Inspection of Figure 1 reveals a fixed systematic bias between observers A and B, that is, observer A consistently measures a higher reading by a fixed amount that does not change according to the size of the reading measured.

Figure 1
figure 1

Scatterplot diagram of the results of axial length measurements using B-scan ultrasound from observers A and B. The dotted line represents the ordinary least squares (OLS) regression line (r=0.98, P<0.0001). The solid line represents the line of equality (y=x).

The intraclass correlation coefficient (ICC) (ratio of between-groups variance to the total variance32) is another correlation statistic often used to assess agreement. The ICC varies from +1 (perfect agreement) to 0 (no agreement). The ICC is designed to assess agreement when there is no intrinsic ordering between two variables (ie the measurements are interchangeable, such as test–retest reliability using the same method33). However, when dealing with method comparison studies, there is a very clear ordering of the two variables (the two methods under comparison).

While the ICC is better able to avoid the confusion of mistaking linear association for agreement, it suffers from being highly dependent on the range of values measured, that is, the greater the variability between subjects, the greater the value of the ICC. Consider a hypothetical group of five subjects who have IOP recorded by two different techniques (Goldmann tonometry vs tonopen) (Table 1). For study 1, the ICC for the two techniques is r=0.167 (P=0.38). When we repeat the study on a different set of subjects (study 2), the ICC for the two techniques becomes r=0.95 (P=0.002). However, despite such extreme differences in the value of the ICC, the actual level of agreement in study 1 and 2 look on inspection to be approximately equal (both studies have the same differences recorded). The reason for the disparity in the ICC values is that in study 1, the range of IOPs is much narrower than study 2.

Table 1 Comparison of two different techniques for measuring IOP (Goldmann tonometry vs tonopen)

However, the ICC may be used to measure agreement34 particularly when between more than two observers/methods.

‘Limits of agreement’ techniques

In 1983, Bland and Altman35 published their seminal article on agreement analysis. The ‘limits of agreement’ technique has become an increasingly popular technique in agreement studies, and has been adopted by many clinical scientists due to it being simple to execute and easy to comprehend, using simple graphics and elementary statistics. This technique involves firstly calculating the differences for each pair of values, and then plotting the differences against the corresponding means for each pair. The values of the differences (A–B) should be normally distributed and should be equally scattered for all levels of the corresponding mean.26 This graphical method also reveals extreme outliers affecting the data sample. The upper and lower ‘limits of agreement’ correspond to the mean difference (A–B)±1.96 standard deviations (SDs). Inspection of the graph will illustrate the upper and lower ‘limits of agreement’, which represents the interval within which 95% of differences between measurements/measurers are expected to lie. The decision as to whether good agreement is demonstrated is a matter of clinical judgement. Three hypothetical Bland–Altman plots (Figures 2, 3 and 4) illustrate how bias can be identified by inspection of the plot. In interpreting Bland and Altman plots, it is important to consider if variability is comparable over the full range of measurements (Figure 3). Often, the variability increases as the measurement increases.28, 36 If so, then the variation in the percent difference may be fixed, and the plot may be redrawn on a logarithmic scale, for example, plotting the ratio or the per cent difference, rather than the absolute difference between the two variables. If the variability is relatively constant, then one looks for any systematic trend in the mean difference (see Figure 4). The presence of bias may in itself not be a problem, provided it is known and can be adjusted for. An illustration of the advantage of the ‘limits of agreement’ technique over correlation in assessing agreement is provided by Murray and Miller.37

Figure 2
figure 2

Hypothetical Bland–Altman plot of IOP recorded by Goldmann tonometry and tonopen. The solid line represents the mean difference (0.2 mmHg), and the dotted lines represent the upper (+2.6 mmHg) and lower (−2.2 mmHg) limits of agreement. This shows a mean difference between both measurements close to zero and no change in the magnitude of difference as the mean IOP increases.

Figure 3
figure 3

Hypothetical Bland–Altman plot of IOP recorded by Goldmann tonometry and tonopen. The solid line represents the mean difference (−1.8 mmHg), and the dotted lines represent the upper (+6 mmHg) and lower (−9.5 mmHg) limits of agreement. This hypothetical example illustrates good agreement between both methods of measurement for the range of IOP <25 mmHg, but beyond this range the relationship breaks down and the Tonopen measures much higher IOPs than the Goldmann tonometer.

Figure 4
figure 4

Hypothetical Bland–Altman plot of IOP recorded by Goldmann tonometry and tonopen. The mean difference was −3.5 mmHg; the upper and lower limits of agreement were −1.9 and −5.1 mmHg, respectively. This shows a fixed systematic bias (the Tonopen was consistently recording a higher IOP than the Goldmann tonometer, but the size of the difference did not change with increasing IOP).

When there are repeated measures (replicate measurements) performed by two methods on the same subjects, calculating the mean of the replicate measurements by each method and then using those pairs of means to compare the two methods can be performed using the ‘limits of agreement’ method.

Cotter et al38 (in assessing test–retest reliability) and Beck et al39 (assessing two methods and their test–retest reliability) used a combination of both the ICC and Bland–Altman plots in their analyses. Their analyses are explicit and comprehensive and allow the reader to accept their conclusions with confidence.

Linear regression techniques

Linear regression techniques can be used to assess agreement, but there are many models of linear regression and it is important to choose the correct regression model for the agreement study.25, 30 Models such as standardised principal component analysis,30 Deming regression,40 and the nonparametric Passing–Bablok model41, 42, 43 may be used, but ordinary least squares regression (OLS) is inappropriate as an assumption of OLS regression is that the values of the y variable are random, whereas the x variable is fixed, without random error. This is rarely the case when examining agreement between measurers/methods.30

Coefficients of repeatability

The repeatability coefficient is a useful statistic when dealing with repeat measurements by the same method (test–retest reliability).26, 44, 45 When there are only two measurements per subject, the repeatability coefficient is 2 × (SD of the differences) between the repeated measures. This is the repeatability coefficient adopted by the British Standards Institution (BSI).6 As the mean difference between two measurements using the same method should be zero, we expect 95% of differences to be <2 SD. Repeatability coefficients can often be used in conjunction with other tests of test–retest reliability, for example, ICCs. As the ‘repeatability coefficient’ is measured in the same units as the variable being measured, it should not strictly be termed a ‘coefficient’ as coefficients are by definition dimensionless. However, the term ‘coefficient of repeatability’ has been adopted to describe this statistical technique. Ruamviboonsuk et al46 use the coefficient of repeatability in comparing test–retest reliability between two visual acuity tests. However, no mention was made regarding whether the SD was unrelated to the magnitude of the score (a necessary assumption to use coefficient of repeatability).

Coefficient of variation

The coefficient of variation provides a relative measure of data dispersion compared to the mean, expressed as either a quotient or as a percentage of the within-subject SD divided by the mean. As the coefficient of variation is dimensionless, it can be used to assess repeatability between two methods of measurement recorded on different scales. However, to be used correctly, the coefficient of variation should be independent of the mean.47, 48, 49, 50

Categorical data

The following techniques can be used to compare agreement for categorical data.

A cross tabulation (row × column) table

A cross tabulation with rater 1's category frequencies attributed to the row, and rater 2's attributed to the column (see Table 2) provides almost all relevant information for assessing agreement. The diagonal of the table represents where rater 1 agrees with rater 2. For good agreement, one would expect on inspection the diagonal of the tabulation to have the greatest number (see Table 2). The data can then be summarised further by calculating kappa or weighted kappa statistics.

Table 2 Hypothetical r × c array of observed frequencies between two raters on a scale (A, B, C in increasing order)

Cohen's Kappa coefficient51, 52

The original purpose of the kappa statistic (the unweighted kappa (K)) is to compare two measurers who use the same nominal scale.53 The kappa statistic gives a value that is an indication of the amount of agreement present, corrected for that which would have occurred by chance.29, 53, 54, 55 The values of K can range from −1 to +1 (zero translates as agreement no better than that which would have occurred by chance). For example, Azuara-Blanco et al56 use the unweighted kappa coefficient (in this case, for bivariate data) to analyse agreement between intra/interobserver reliability for glaucoma experts in the detection of glaucomatous changes of the optic disk.

Weighted kappa statistic (Kw)

This is intended for ordinal categorical data (eg none, mild, moderate, severe). A weighting system is incorporated into the K statistic, so that greater degrees of disagreement (eg none pairing with severe) are given greater penalty. The commonest weighting system is a quadratic weighting system in which the weights for proportional disagreement progress geometrically.

For the Kw, the value depends solely on the values that are not on the diagonal line of agreement (ie all off-diagonal entries). The cells on the diagonal line of agreement are given a value of 0. A simple method for interpreting the K or Kw value is the empirical approach proposed by Landis and Koch,57 whereby 0.81≤K≤1.00 represents almost perfect agreement, 0.61≤K≤0.80 represents substantial agreement, and so on until 0.00≤K≤0.20 represents slight agreement.

As K and Kw statistics are correlative statistics, they are dependent on the prevalence of the characteristic being studied.54, 58 This makes it difficult to compare two or more kappa values when the true prevalence for the groups or characteristics differs.

Percentage agreement

This is a value that relates the number of measurements that agree to the total number of comparisons (expressed as a percentage). It is a crude assessment that does not tell us a great deal about agreement and does not incorporate any adjustment for agreement by chance.29 Hence, it is of little use in agreement studies, especially as there are superior techniques to compare categorical agreement.

Summary

We describe simple and appropriate statistical strategies that can be used in the analysis of agreement studies. In addition, some approaches that are inappropriate have been highlighted, to discourage their future use in ophthalmic journals. Below, we provide a flow chart (Figure 5) to help choose an appropriate method when analysing agreement of both continuous and categorical data. The reader should note that this serves as a flexible guide and not a regimented structure. It should also be noted that the technique to use should if at all possible be decided a priori, at the study design stage.

Figure 5
figure 5

Flow chart to choose an appropriate statistical approach for a particular agreement study.

In a recent review article, Altman emphasised the importance of the misuse of statistics in medical journals.59 It is important for all those practising evidence-based medicine to be familiar with appropriate statistical techniques for agreement analysis, so that they can judge the relative merits of those publications claiming to show reliability for a particular test.