Evidence-based surgery relies on the results of randomised controlled trials (RCTs). The judicious design, analysis, and reporting of RCTs allow surgeons to effectively use the results in routine practice [1,2,3]. Since the included population in an RCT is not homogenous a priori, treatment effects might vary across different subgroups. Thus, assessing treatment outcome heterogeneity across subgroups and identifying patient characteristics that may modify the effect of the intervention under investigation has become common practice [4]. The subgroup analyses, if true, might have important implications for surgical practice but are often proven to be unreliable and are criticised for their ability to turn negative results into positive ones [5, 6]. Here, we revisit the 11 criteria (Table 1) introduced by Sun et al. [7] and provide literature case examples to illustrate important principles and concepts in the interpretation of a subgroup analysis and to guide researchers in deciding their credibility in the ophthalmological literature.

Table 1 CRITERIA in assessing the credibility of subgroup analysis.

Subgroup analyses are either planned a priori before randomisation or they emerge after randomisation (post-hoc) [4]. The former is credible if planned based on a prespecified hypothesis, if there is a justified direction of the overall and subgroup effect, and if there is appropriate statistical testing for the underlying hypothesis. For instance, an RCT of 702 patients (Protocol V of the DRCR net) with diabetic macular oedema and good visual acuity (VA) was designed to assess the effect of initial management with aflibercept or laser photocoagulation on vision loss versus observation [8] but failed to find significant changes in VA from either treatment versus observation in the overall population or across the predefined subpopulations. The magnitude and direction of the overall and subgroup effects are likely predictable if they are hypothesised based on a sound biological and clinical plausibility [7, 9]. Post-hoc subgroup analyses, in contrast, are data driven and are considered exploratory or hypothesis generating. Their credibility is compromised by the effect of intervention and lack of statistical power [7, 9].

Simultaneous subgroup analyses create multiplicity, inflating the defined nominal significance level (alpha) [10] which increases the likelihood of spurious and compelling results by chance alone [1]. To combat this, it is recommended to prespecify few highly relevant subgroups, use appropriate statistical tests to examine interactions between treatment effect and subgroup variables, and ensure p-values are adjusted for multiple testing [1, 7, 11]. The interaction test determines if treatment effects differ between different subgroups with the assumption that the true effect is the same across each subgroup category [1, 7, 12, 13]. The smaller the p-value of the interaction test, the stronger the subgroup effect. For instance, the CATT [14] conducted a non-inferiority trial to compare the efficacy of ranibizumab versus bevacizumab on either a monthly scheduled or an as needed regimen in patients with neovascular age-related macular degeneration and found equivalent gain in VA by treatment and dosing regimen at 1 year. Each of the monthly scheduled treatment groups were then rerandomized into monthly scheduled or as needed regimen. The Year-2 CATT [15] assessing 2-year effects of the four original groups and the impact of switching from monthly scheduled to an as needed regimen found a similar gain in VA between treatment groups [1.4 letters difference; 95% CI −0.8, 3.7] but greater gain in the monthly scheduled regimen [2.4 letters difference; 95% CI 0.1, 4.8]. The difference did not exceed the non-inferiority margin of 5 letters. To increase power and precision, the treatment and scheduling effects were analysed between treatment and scheduling regimens (interactions P-value of ≥0.10 for non-inferiority hypothesis) rather than the effects of each drug by scheduling regimen type. The p-value for interaction is rarely reported in the ophthalmology literature, making the independence of subgroup effects uncertain [8, 16].

Given differences in the administration of surgical treatments and extent of biological variability, the interaction between treatment effect and various patient variables should be interpreted with caution [1]. The strength with which an inference is made on subgroup effects largely relies on the magnitude of the difference [9, 17]. That is, as the magnitude of treatment effect increases for a subgroup, the likelihood of a real subgroup difference rises. The validity of a subgroup analysis largely depends on reporting all of the conducted subgroup analyses regardless of their statistical significance [1] as well as consistency of the treatment effect across closely related outcomes [9]. A pooled analysis of two RCTs of 107 patients with highly relapsing neuromyelitis optica spectrum disorder [18] illustrates effective adherence to these principles in its design. The study found that the improvements in aggregated proptosis and diplopia responses from teprotumumab intravenous infusions compared to placebo were large and consistent, both in the overall population and across several predefined subgroups.

Arguably, the consistency of the subgroup effects in subsequent well-designed trials provide stronger credibility. Subgroup effects are also more credible if the comparison was made within a study rather than across multiple studies with different methodological qualities [9]. Planning subgroups based on the current understanding of biological mechanisms by anticipating pathophysiological, genetic, or biological heterogeneity [3] is equally important. The accounting for these criteria may be infeasible considering the heterogeneity of intervention, rarity of patient population and poor reporting quality of RCTs in the literature [2, 19]. For instance, a meta-analysis of 17 RCTs examining the effect of omega-3 fatty acid supplementation for the treatment of dry eye disease [20] reported a significant decrease in dry eye symptoms from daily omega-3 fatty acid supplementation with 96% heterogeneity. Post-hoc subgroup analyses by country showed significantly larger treatment effects in trials from India compared to elsewhere. One possible explanation was the predominant vegetarian diet and low intake of omega-3 fatty acids in India. Another explanation might be that five of six trials from India were conducted by the same group of authors on similar setting and population.

Well-designed surgical RCTs adequately assess the effectiveness and safety of new surgical treatments in the overall population, but reliable analysis of treatment effects across subpopulations has been slow to adapt [1]. Surgical RCTS should provide a thorough investigation of the benefits and harms of a new treatment in the overall population and key subpopulations. This editorial highlights the 11 criteria as a general guide for clinician readers of evidence regarding its use in clinical settings, but researchers interested in systematic reviews and individual research planning could consider following ICEMAN [21] as a more comprehensive instrument.