An international comparison of retinopathy of prematurity grading performance within the Benefits of Oxygen Saturation Targeting II trials

Purpose To investigate whether the observed international differences in retinopathy of prematurity (ROP) treatment rates within the Benefits of Oxygen Saturation Targeting (BOOST) II trials might have been caused by international variation in ROP disease grading. Methods Groups of BOOST II trial ophthalmologists in UK, Australia, and New Zealand (ANZ), and an international reference group (INT) used a web based system to grade a selection of RetCam images of ROP acquired during the BOOST II UK trial. Rates of decisions to treat, plus disease grading, ROP stage grading, ROP zone grading, inter-observer variation within groups and intra-observer variation within groups were measured. Results Forty-two eye examinations were graded. UK ophthalmologists diagnosed treat-requiring ROP more frequently than ANZ ophthalmologists, 13.9 (3.49) compared to 9.4 (4.46) eye examinations, P=0.038. UK ophthalmologists diagnosed plus disease more frequently than ANZ ophthalmologists, 14.1 (6.23) compared to 8.5 (3.24) eye examinations, P=0.021. ANZ ophthalmologists diagnosed stage 2 ROP more frequently than UK ophthalmologists, 20.2 (5.8) compared to 12.7 (7.1) eye examinations, P=0.026. There were no other significant differences in the grading of ROP stage or zone. Inter-observer variation was higher within the UK group than within the ANZ group. Intra-observer variation was low in both groups. Conclusions We have found evidence of international variation in the diagnosis of treatment-requiring ROP. Improved standardisation of the diagnosis of treatment-requiring ROP is required. Measures might include improved training in the grading of ROP, using an international approach, and further development of ROP image analysis software.


Introduction
Retinopathy of prematurity (ROP) is graded using the International Classification of Retinopathy of Prematurity (ICROP). 1 While standard images are provided in ICROP, examiners must use subjective judgement when describing ROP in an infant. Variation in the rates of severe ROP between clinical centres have been attributed in part to observer bias. 2 A number of studies have demonstrated interobserver variation when grading ROP using retinal images. [3][4][5][6][7][8][9][10] Five international, multicentre randomised controlled trials of oxygen saturation targeting in very premature infants have recently been reported. The trial protocols were prospectively aligned to facilitate meta-analysis, the NeOProM collaboration. 11 The trials were performed in UK, 12,13 Australia,12,13 New Zealand, 12,14 Canada, 15 and USA. 16,17 The Benefits of Oxygen Saturation Targeting (BOOST) II trials performed in UK, Australia, and New Zealand reported outcomes at the time of hospital discharge in 2013. 12 While the participants (premature infants) were broadly similar across countries, a large difference in ROP treatment rates was noted. 12 153/798 (19.2%) of enroled infants were treated in the UK, compared to 75/975 (7.7%) in Australia and 23/306 (7.5%) in New Zealand. 12 All ophthalmologists in the BOOST II trials were instructed to base their decision to treat on the ETROP 18 definition of Type 1 ROP, however subjective interpretation of ROP disease signs may have varied between countries.
Within the BOOST II UK trial, ophthalmologists in 12 of the 34 trial centres used RetCam imaging (Natus Medical, Pleasanton, CA, USA) for ROP screening. 19 Imaging was not performed in the other UK centres. These images gave us the opportunity to compare ROP grading decisions made by ophthalmologists in the UK, Australia, and New Zealand who participated in the BOOST II trials. An international reference group was used as the gold standard. We aimed to determine whether international variation in the interpretation of images and subsequent treatment decisions was present, evidenced in our opportunistic cohort.

Ophthalmologists participating in the BOOST II trials
Within the BOOST II trials, local ophthalmology services for routine ROP screening and treatment were used. In the UK, BOOST II UK trial ophthalmologists were asked to attend a training session on ROP classification, and were provided with printed training materials. In Australia and New Zealand, all BOOST II trial ophthalmologists were asked to self-certify prior to the trials, using a training and assessment website http:// www.boostnz.info/ROP/.

Readers
Nine readers from Australia, two from New Zealand, and seven from UK who participated in the BOOST II trials were used (Supplementary Information). The groups from Australia and New Zealand were combined (ANZ) because the number of readers from New Zealand was low, and because ophthalmologists in Australia and New Zealand have a close working relationship for training and clinical practice. An international reference group of six experienced ophthalmologists with an interest in ROP who had not participated in the trials was used as the 'gold standard' (INT) (Supplementary Information). The international reference readers were from UK (2), USA (2), Canada (1), and Australia (1). The median (range) number of year's experience of the readers in performing clinical ROP screening examinations was 25 (14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26) for the UK group, 15 (3.5-40) for ANZ, and 21  for the international reference group.

Reading experiments
Each reader logged on to the study website and was given detailed instructions on how to classify the study images. Readers were referred to ICROP, 1 but standard comparison images were not given. To protect patient anonymity, no clinical data were provided. For each eye examination, drop down menus were used to grade ROP, and a decision to 'treat' or 'not treat'. The order of eye examinations was randomised each time a reader logged on. On completion of grading, data were downloaded to an Excel spreadsheet for analysis.

Eye examination images
Images were selected by the lead study ophthalmologist (BWF) for high image quality and readability. An eye 'examination' was a set of one to five images obtained when examining one eye of one infant on one occasion. All selected eye examinations were performed prior to treatment. Forty-two eye examinations obtained from six centres were used (Supplementary Information). In some infants more than one eye examination was used, to ensure a range of ROP disease severity was available for review. When more than one examination was used from the same infant, each examination was performed on a different date. In 31 infants one eye examination was used, in 3 infants two examinations were used, and in one infant five examinations were used. Six of the 42 eye examinations, illustrating a range of ROP severity, were duplicated to allow measurement of intra-observer variation. Each reader assessed 48 eye examinations. Seventeen of the 42 (40.5%) image sets were obtained at the time when a decision to treat was made, or immediately prior to treatment. Thirteen of the 42 (31.0%) image sets were from infants who did not require treatment at the time of imaging, but who were subsequently treated. Twelve of the 42 (28.6%) image sets were from infants who were not treated for ROP at any time.

Infants
RetCam images from 35 infants were used, linked to clinical data from the BOOST II UK trial. Thirty-four infants were white, one was British Pakistani. Seventeen (48.6%) infants were female and 18 were male. The mean gestational age was 25 +2 weeks, range 22 +6 -27 +6 weeks. The mean (SD) birth weight was 785 (170) g, range 366-1115 g. Twenty-three of the 35 (65.7%) infants were treated for ROP at some time in their clinical course.

Statistical analysis
Descriptive statistics were used to summarise data according to type and distribution using counts/ percentages for categorical data, means (standard deviations [SD]) for normally distributed continuous variables, and medians (ranges) for other continuous variables. Data obtained from duplicate eye examinations were only used for the calculation of intra-observer variation and were excluded from all other analyses. Inter-observer variation (Fleiss kappa) and intra-observer (Cohen kappa) values were calculated using the online tool www.statstodo.com. Conventionally, a kappa of 0.2 or less is considered poor agreement, 0.21-0.4 fair, 0.41-0.6 moderate, 0.61-0.8 strong and more than 0.8 near complete agreement. 20 These terms were used when reporting our results.

Treatment decisions
Of the 42 eye examinations reviewed the mean (SD) number of examinations per reader judged to require treatment was 13.9 (3.49) for UK readers, 9.4 (4.46) for ANZ readers, and 12.8 (5.49) for the international readers. The difference between UK and ANZ readers was significant (t-test P = 0.038, mean difference = 4.49, 95% CI = 0.27-8.72).

Plus disease
Of the 42 eye examinations reviewed the mean (SD) number of examinations per reader judged as 'plus' disease was 14.1 (6.23) for UK readers, 8.5 (3.24) for ANZ readers, and 13.2 (6.31) for the international readers ( Table 1). The difference between UK and ANZ readers was significant (t-test P = 0.021, mean difference = 5.69, 95% CI = 0.98-10.40).

Stage of retinopathy of prematurity
For each reader, the number of examinations read for each ROP stage was calculated. The mean (SD) for each reader group is given in Table 2. The mean number of eye examinations per reader classified as stage 2 was higher in the ANZ group than in the UK group (t-test, P = 0.026, mean difference = 7.47, 95% CI = 1.00-13.94). For stage 3 there were no significant differences between the groups.

Zone
For each reader, the number of examinations read for each ROP zone was calculated. The mean (SD) for each reader group is given in Table 3. The proportion of eye examinations read as each zone was not significantly different between any pair of groups.

Inter-observer variation
Inter-observer variation Fleiss kappa measures for each classification variable are given in Table 4. Inter-observer agreement for the whole group of readers was 'fair' or 'moderate' for all measures. Agreement was highest within the ANZ group for all measures, with 'moderate' agreement for treatment decisions and for plus disease categories. Agreement was 'fair' for treatment decisions within the UK group. Agreement was poor for most measures within the INT group.

Intra-observer variation
We measured intra-observer variation by including six duplicate examinations within the 48 eye examinations shown to each reader. The results are shown in Table 5.
All kappa values were within the 'strong' or 'near perfect' agreement categories.

Discussion
We have compared the ROP grading decisions of BOOST II trial ophthalmologists in UK with those in Australia and New Zealand. UK ophthalmologists demonstrated a lower threshold to treat than Australian and New Zealand ophthalmologists. UK ophthalmologists graded more images as plus disease, and more images as treatment-requiring. There were no significant differences in grading stage 3 disease or ROP zone. The UK ophthalmologists had more inter-observer variation than the Australian and New Zealand ophthalmologists. Intraobserver consistency appeared to be good among all ophthalmologists. The international reference ophthalmologists graded in a similar way to the UK ophthalmologists.
There were a number of limitations in our study. While the data were obtained within the context of a clinical trial, RetCam images and ROP clinical data were obtained from routine clinical screening examinations. RetCam imaging was used in a limited number of centres during the BOOST II UK trial, and in some centres was only used immediately prior to treatment. The quality of images obtained was variable. The completeness of accompanying clinical data from the treating ophthalmologists was variable. The set of RetCam images used for the study was selected, not random. The groups of readers from each country were biased towards experienced, research-active ophthalmologists. The international reference group was limited in number, and may not have been truly representative of broad-based international expertise. The sample size of both RetCam images and of readers was small and therefore of insufficient power to detect all but the largest differences. The context of this study was a group of five oxygen trials in premature infants-the NeOProM collaboration. 11 Significant differences in ROP treatment rates between countries were evident. Within the BOOST II trials performed in UK, Australia, and New Zealand, 153/798 (19.2%) of enroled infants were treated in the UK, compared to 75/975 (7.7%) in Australia and 23/306 (7.5%) in New Zealand. 12 Thus, in the UK 153 infants were treated, and 645 were not treated. In Australia and New Zealand combined (ANZ), 98 were treated and 1183 were not treated. The difference in treatment rates was significant (Chi squared test Po0.0001, odds ratio = 2.51, 95% CI = 1.98-3.18). In the Canadian COT trial, 130/1003 (13%) of trial survivors at 36 weeks postmenstrual age had undergone ROP treatment or had Stage 4 or 5 ROP. 15 In the American SUPPORT trial, 120/913 (13.1%) of trial survivors at 36 weeks postmenstrual age had undergone ROP treatment or had been diagnosed as having Type 1 ETROP. 18 If the Canadian and USA trials are combined, 15,16 250 of 1916 (13.0%) were treated. The difference in treatment rates between the North American trials and the ANZ trial was significant (Chi squared test Po0.0001, odds ratio = 1.71, 95% CI = 1.37-2.13), and the difference in treatment rates between the UK trial and the North American trials was also significant (Chi squared test Po0.0001, odds ratio = 1.47, 95% CI = 1.22-1.77). These differences are unlikely to be due to chance.
The baseline clinical characteristics of infants enroled in the BOOST II UK, BOOST II Australia and the BOOST New Zealand trials were very similar. 12 In addition, the measured oxygen treatments given to the infants in the trials were very similar, as were morbidity measures (other than treatment for ROP), and mortality. 12 The cohorts enroled in the Canadian and USA trials 15,16 were also similar to those in the BOOST II trials. It is therefore unlikely that the difference in treatment rates between the individual studies was due to differences in the patient populations.
Different treatment rates could potentially result in different visual outcomes. The 2 year outcome data from the UK and Australian trials 13 and from the New Zealand trial 14 gave visual outcome data. In the UK 23 of 718 infants (3.2%), in Australia 5 of 911 infants (0.55%), and in New Zealand 1 of 340 infants (0.29%) had severe visual impairment. Additional information was available for the subgroup of UK infants treated with the revised oxygen algorithm. 13 Eighteen of 551 (3.3%) had severe visual impairment. Four of these had retinal detachment, 12 had cerebral visual impairment and in two the cause was not recorded. Thus 4 of 551 (0.73%) had severe visual    22,23 and over time. 9,[24][25][26] Some variation may be due to differences in the clinical characteristics of the populations under study, and to neonatal care practices. This is likely to be the case when comparing countries with differing health service characteristics and over periods of time. 9,22,24,25 The clinical characteristics of the infants in the BOOST II trials were very similar. 12,13 In this study, we have explored the possible contribution to the observed different rates of ROP treatment of international variation in disease grading. 2,9 Our results suggest such variation was present.
While inter-observer agreement for plus disease grading was 'moderate' within the ANZ group, it was 'poor' for the UK group. Previous studies have also found limited agreement between experts in the diagnosis of treatment-requiring ROP, 10,27 and of plus disease. 10 Gschliesser found moderate inter-observer agreement (kappa 0.41) for the necessity for treatment, and 'fair' agreement (kappa 0.32) for plus disease. 10 Chiang found 'fair' and 'moderate' weighted kappa agreement for the diagnosis of plus disease when each of a group of experts was compared to all the other experts in the group. 5 While standardisation of ROP diagnostic grading may be approached by improved training of screening ophthalmologists, 28,29 an international approach is needed. Tools such as online training and assessment websites may be used. In Australia and New Zealand all BOOST II study ophthalmologists were asked to selfcertify prior to the trials, using http://www.boostnz. info/ROP/.
The key component in ROP treatment decisions is the detection of plus disease, as defined by ICROP. 1 Our study, and a number of other studies, 5 show the limitations of clinical judgement based on reference photographs. As has occurred in diabetic retinopathy screening, a move towards the use of retinal images rather than clinical examinations is a prerequisite for the standardisation of diagnostic decisions. 30,31 Computerised image analysis techniques, trained by clinical experts, are needed to improve the objectivity of treatment decisions. [30][31][32][33][34][35][36][37][38][39] The planning of international ROP treatment trials requires improved training and standardisation of observers. In the Cryotherapy for ROP study, a second examiner was required to examine each infant within 3 days of the primary examiner, to confirm the presence of treatment-requiring 'threshold' disease. 40 In 12% of cases, the two examiners disagreed on the presence of plus disease. 27,40 Ideally, retinal images should be used in trials, with central reading centres. 41 Both clinical trials and clinical practice will benefit from the use of image analysis software that quantifies plus disease.
We found international variation in the diagnosis of treatment-requiring ROP. While excessively low rates of ROP treatments risk blindness, excessively high rates of ROP treatments should also be avoided. Treatment is invasive, and carries risks of ocular and systemic morbidity. Improved standardisation of treatment decisions is an important goal. Approaches might include the use of internationally standardised online training tools, and the development of image analysis software to quantify ROP plus disease.

Summary
What was known before K Differences in ROP treatment rates have been documented between centres, between countries, and over time. These may have occurred for a number of reasons. ROP treatment rates varied between countries in the BOOST II trials. A number of studies have found variability between experts in the diagnosis of treatment-requiring ROP, and of plus disease.
What this study adds K In retinal image ROP grading experiments, BOOST II UK trial ophthalmologists were more likely to diagnose plus disease, and treatment-requiring ROP than BOOST II trial ophthalmologists in Australia and New Zealand. As the patient populations were very similar in the BOOST II trials, it is likely that variation in treatment rates between countries was due to international variation in ROP grading and treatment decisions.

Conflict of interest
Dr Anna Ells was a member of the Scientific Advisory Board of Clarity Medical Systems, the manufacturer of RetCam technology, until 2011 and Professor Alistair R Fielder has received remuneration for presentations on behalf of Clarity Medical Systems.