Chronic graft-versus-host disease (cGVHD) is a major complication of allogeneic hematopoietic stem cell transplantation (allo-HCT) that is related to higher mortality and morbidity [1, 2]. Glucocorticoids has been the mainstay of the treatment for cGVHD, while it also has been widely used to treat the variety of autoimmune diseases as the combination with other immunosuppressive agents including azathioprine (AZP) to reduce long-term complications of glucocorticoids such as diabetes mellitus, iatrogenic Cushing’s syndrome, avascular necrosis of joints and osteoporosis, etc. [3–6].

Although a previous clinical trial suggested that prednisone (PRD) based regimen plus AZP (PRD + AZP) resulted in worse survival than PRD-based regimen in a standard risk group of cGVHD patients due to higher non-relapse (infection-related) mortality (NRM) [7], the therapeutic efficacy of AZP might deserve to be looked at again because there have been advances in the allo-HCT field for over the last decades, including significant improvement in supportive care such as infectious prophylaxis and treatment, as well as in evaluating cGVHD systematically. The National Institutes of Health (NIH) first proposed consensus criteria for the diagnosis of cGVHD, and tools for scoring cGVHD organ involvement and assessing overall severity in 2005, which are now widely used in clinical practice [8, 9]. In addition, a new statistical endpoint for evaluating the efficacy of cGVHD treatment, i.e. failure free survival (FFS), has been introduced and suggested to be a potential surrogate of overall survival (OS) for cGVHD treatment [10, 11]. Therefore, we retrospectively reviewed 668 consecutive patients who underwent allo-HCT between 2004 and 2012 at Princess Margaret Cancer Centre, Toronto, Canada in order to compare the efficacy of PRD + AZP and PRD-based regimens with respect to FFS as well as OS, NRM, and the incidence of relapse.

Chronic GVHD was defined, reclassified and graded by the NIH consensus criteria [8]. Among 313 patients with redefined cGVHD, we then identified 240 patients who received PRD or PRD + AZP as first line treatment for cGVHD. Late onset acute GVHD was excluded from the analysis.

The FFS was defined as time from the initiation of frontline treatment for cGVHD to treatment failure (TF), NRM or relapse of disease. TF was defined as initiation of the next line of IST for cGVHD [11] or an escalation of the dose of PRD to ≥1 mg/kg/day regardless of the target organ. OS and FFS were calculated by the Kaplan–Meier method and compared using the log rank test. The cumulative incidences of NRM, disease relapse, and the TF rate (TFR) for front line cGVHD treatment were estimated considering competing risks, with disease relapse, NRM and TFR considered as mutually-competing risks.

The transplant-related characteristics were analyzed to compare the PRD and PRD + AZP groups using Pearson’s Χ2 or Fisher’s exact test. The univariate and multivariate analyses performed to compare OS, NRM, relapse incidence, and FFS between two treatment groups. OS and FFS were compared using the log rank test. Univariate analyses for incidence with competing risks were performed by Gray’s method. Cox proportional hazard regression model was used for multivariate analysis of survivals.

Since the characteristics of cGVHD of two treatment groups were imbalanced (Table 1), we performed a propensity score matching (PSM) analysis as a case-control study in order to adjust the potential confounding effects of the clinical features of cGVHD on treatment outcome. The clinical variables included in the propensity score calculations were global score (GS) by the NIH consensus criteria, the classification of the cGVHD (classical or overlap syndrome), age, gender, duration from allo-HCT to initiation of cGVHD treatment, performance status (PS), progressive type onset (PTO) of cGVHD, thrombocytopenia and organs involved cGVHD per skin, gastrointestinal track, liver, lung, and musculoskeletal system. A total of 74 case-control pairs were identified with <0.1 of a difference in propensity score.

Table 1 Characteristics of patients and chronic GVHD

Of the 240 patients included in the analysis, 154 (64.2%) received myeloablative conditioning (MAC) and 86 (35.8%) reduced-intensity conditioning (RIC) (Table 1). There were no significant differences in pretransplant characteristics between the PRD and PRD + AZP groups except for T-cell depletion (TCD); 33 patients (23.2%) in the PRD group and 13 (13.3%) in the PRD + AZP group underwent T-cell depletion (p = 0.054). The imbalanced characteristics of cGVHD were observed between the 2 groups, including longer duration from HCT to diagnosis of cGVHD (p < 0.001) in the PRD + AZP group; also fewer patients with severe cGVHD (p < 0.001), fewer with PTO (p = 0.002), fewer with thrombocytopenia (p = 0.008) and better PS (p = 0.008).

With a follow-up duration of 43.6 months among survivors, 2-year FFS, TFR, NRM, and relapse incidence were 24.7% (95% confidence interval (CI), 19.1–30.8%), 57.5% (50.8–64.0%), 7.5% (4.5–11.5%), and 10.1% (6.5–14.5%), respectively. The PRD + AZP group had a higher FFS rate at 2 years (36.4% [26.2–46.6%]) than the PRD group (16.8% [10.8–23.9%], p < 0.001) (Fig. 1a) and a lower incidence of TFR at 2 years (52% [40.8–62.0%] versus 61.5% [52.5–69.3%], p = 0.050). In addition, it had a lower NRM rate at 2 years (3.4% [0.9–8.85] versus 10.5% [6–16.5%], p = 0.050). There was no difference between the groups in the cumulative incidence of relapse at 2 years; 8.3% (3.6–15.5%, p = 0.507) in PRD + AZP group and 11.3% (6.5–17.4%) in PRD group.

Fig. 1
figure 1

Survivals and the treatment failure (n = 240). a Failure-free survival comparing the prednisone and prednisone/azathioprine groups. b Adjusted overall survival comparing the prednisone and prednisone/azathioprine groups considering the severities of chronic graft-versus-host disease and performance status

Severity by the NIH consensus criteria was well-correlated with FFS. The FFS rate at 2 years was 62.2% (39.9–78.3%) in mild, 20.5% (14.2–27.7%) in moderate, and 16.9% (7.5–29.6%) in severe cGVHD (p < 0.001). Patients with mild cGVHD had a lower TFR (29.2% [12.6–48.1%]) at 2 years than those with moderate/severe cGVHD (61.4% [54–68%], p = 0.008). Severity by the NIH consensus criteria does not correlate with the cumulative incidence of NRM (p = 0.538) or relapse (p = 0.826). None of the factors associated with FFS or cumulative incidences of TFR and NRM were correlated with relapse rate.

OS at 2 years was 71.6% (64.6–77.4%); PRD + AZP group showed better survival compared to the PRD group (OS at 2 years; 82.1% [71–89.2%] versus 64.8% [55.4–72.8%], p < 0.001). And the severities of cGVHD and PS correlated well with 2-year OS (p = 0.004 and p < 0.001, respectively). The adjusted OS for PRD and PRD + AZP groups demonstrated statistical significance considering the severities of cGVHD and PS (HR for PRD group 2.09 [1.22–3.58], p = 0.007) (Fig. 1b).

Univariate analysis for FFS identified several risk factors associated with worse FFS including moderate/severe cGVHD (median FFS (months); 55.9 versus 7.6, p = 0.001), ECOG PS ≥ 2 (9 versus 5.4, p = 0.003), thrombocytopenia (6.5 versus 6.2, p = 0.05), PTO (8.6 versus 2.7, p = 0.001), and PRD group (13.2 versus 5.6, p < 0.001). Multivariate analysis confirmed that moderate/severe cGVHD (hazard ratio [HR] 3.10, p < 0.001), PTO (HR 2.21, p = 0.001) and PRD (versus PRD + AZP) as the first-line treatment regimen (HR 2.12, p < 0.001) were risk factors for worse FFS.

After PSM, the characteristics of cGVHD were well-balanced in the two groups (Table 1). The PSM analysis confirmed the findings of superior outcomes in the PRD + AZP group. Two-year FFS was significantly better in the PRD + AZP (36.4%) than the PRD group (16.8%, p < 0.001). The cumulative incidence of TFR for frontline treatment at 2 years was also lower in the PRD + AZP group (52.4% versus 70.1%, p = 0.013). There were no significant differences in NRM or relapse rate at 2 years, but a trend towards longer OS was again observed in the PRD + AZP group of the PSM cohort (85.3% [72.6–92.4%] at 2 years in PRD + AZA group versus 75.9% [63.1–84.8%] in PRD group, p = 0.066).

When confined to the same severity level according to the NIH consensus criteria, there was also a trend towards longer FFS in the PRD + AZP group: the favorable effect of PRD + AZP was statistically significant in the subgroup with moderate grade of cGVHD [FFS at 2 years (%); 30.5 versus 9.1, p = 0.001], but not in the mild and severe grades. Similar results were obtained for the cumulative incidence of TFR of frontline treatment at 2 years among the patients with moderate cGVHD; 56.2% (41.6–68.6%) in the PRD + AZP group and 71.4% (46.8–81.7%) in the PRD group (p = 0.035).

In addition, it was found that tapering of PRD dose < 0.5 mg/kg/day was more successful in the PRD + AZP group than in the PRD group: the cumulative incidence of PRD < 0.5 mg/kg/day at 6 months was 90.5% in the PRD + AZP group and 75.8% in PRD group (p = 0.018).

Although PSM analysis performed to overcome and control the imbalance of patients’ characteristics between PRD and PRD + AZP groups, the results of this study should be interpreted with caution given the nature of the retrospective analysis of this study, which would be weak evidence to support the role of AZP in cGVHD treatment compared the previous trial [7]. However, AZP added to a PRD-based regimen as the first-line treatment for cGVHD seems to improve FFS and may have a role as a steroid-sparing agent in the modern allo-HCT era. Since two thirds of the patients who required PRD-based treatment for cGVHD experienced the TF at 2 years, a better treatment strategy would be required. AZP could be worth reconsidered as a relevant option for a steroid sparing agent in cGVHD treatment.