Introduction

Biopsy is an important tool for monitoring kidney allografts, and the findings of kidney allograft biopsy can guide clinical management and provide prognostic information. The Banff Classification of Renal Allograft Pathology, first published in 1993, is widely applied to the diagnosis of kidney allograft rejection1, and the importance of interstitial inflammation has been emphasized since the establishment of the classification. According to the most recent Banff classification update2, the diagnosis of acute T-cell mediated rejection requires a minimum of moderate tubulitis and moderate interstitial inflammation in the non-scarred cortex. Similarly, the diagnosis of chronic active T-cell mediated rejection requires a minimum of moderate tubulitis, moderate total cortical inflammation, and moderate inflammation in the scarred cortex.

Reproducibility, quantified by inter-rater and intra-rater reliabilities, is one of the most important attributes of any classification or scoring/grading scheme. Given the importance of interstitial inflammation in the diagnosis of kidney allograft rejection, good reproducibility is essential. Many studies have already confirmed the therapeutic and prognostic relevance of the Banff classification3,4,5,6. However, publications on the reproducibility, especially the inter- and intra-rater reliabilities of interstitial inflammation scorings, are scarce7,8,9,10,11,12. The reliability coefficients reported in these publications are quite varied and range from 0.33 to 0.65. A general understanding of how well or poorly renal pathologists perform on interstitial inflammation scoring assessments is not well established, the reported reliabilities coefficients are obscured to transplant practitioners, and the reasons for good or poor reliability have not been investigated; furthermore, to what extent the variation in interstitial inflammation scorings may impact clinical practice is unknown.

To address the above issues, the current study was undertaken to investigate the inter-rater and intra-rater reliabilities of interstitial inflammation scoring according to the Banff classification. In addition to the traditional kappa statistic approach, conditional agreements were calculated to improve the transparency and increase the understandability of inter- and intra-rater reliabilities for nephrologists, transplant surgeons, and renal pathologists. The findings provided a comprehensive picture of the reproducibility of interstitial inflammation scoring and should call attention to its possible clinical implications.

Results

Inter-rater reliabilities of inflammation scorings were suboptimal and had wide ranges

Table 1 summarizes the pairwise inter-rater reliabilities of interstitial inflammation scorings of the 8 raters (28 pairs in total). For inflammation in the non-scarred cortex (i score), pairwise weighted kappa values ranged from 0.14 (slight agreement) to 0.44 (moderate) and were 0.27 (fair) on average with a 95% confidence interval (CI) of 0.24 to 0.29. For total cortical inflammation (ti score), pairwise weighted kappa values ranged from 0.08 to 0.52. For inflammation in the scarred cortex (i-IFTA score), pairwise weighted kappa values ranged from 0.05 to 0.47. The inter-rater reliabilities of the three scorings did not significantly differ from each other (0.27 vs. 0.30 vs. 0.26, P = 0.147 by linear mixed model; LMM). Figure 1 shows the pairwise inter-rater reliabilities of interstitial inflammation scorings grouped in rater pair order. Most kappa values were lower than moderate agreement. No rater pair showed consistently better or worse inter-rater reliabilities on all three scorings, which was confirmed by one-way ANOVA (P = 0.063).

Table 1 Pairwise inter-rater reliabilities of interstitial inflammation scorings.
Figure 1
figure 1

Pairwise inter-rater reliabilities of interstitial inflammation scorings grouped in rater pair order. The upper dashed line indicates a kappa value of 0.4, which is regarded as the lower limit of moderate agreement. The lower dashed line indicates the total average of all kappa values. Red, blue, and purple boxes and whiskers indicate the i score, ti score, and i-IFTA, respectively. Error bars: 95% confidence interval.

Conditional agreement probabilities on scorings

Table 2 shows the conditional agreement probabilities for each rater on each inflammation score. For example, if rater 1 assigned i0 to a specific case, the probability that a random rater would also assign i0 to that case was 49.1%. Similarly, if rater 8 assigned ti0 to a specific case, the probability that a random rater would also assign ti0 to the case was 53.9%. Average conditional agreements ranged from 38.6% to 61.3% for the i score, 36.7% to 58.6% for the ti score, and 37.8% to 61.8% for the i-IFTA score. For many individual scores, chi-squared tests confirmed that raters performed differently (data not shown). However, no single rater had significantly better or worse agreement with other raters (P = 0.712; data not shown).

Table 2 Conditional agreement probabilities for each rater on each inflammation score.

Given the critical interstitial inflammation thresholds for a definite diagnosis of acute T-cell mediated rejection (≥ i2) and chronic active T-cell mediated rejection (≥ ti2 and ≥ i-IFTA2), we dichotomized the scores into two groups: none/mild (score 0/1) and moderate/severe (score 2/3) inflammation (Table 3). Taking the top left cell (Rater 1; i score 0/1; 52.3%) for example, it indicates that in 52.3% of the time that a random rater would also assign i score 0 or 1 to those cases Rater 1 assigned i score 0 or 1. In other words, in 47.7% (1–52.3%) of the time that a random rater would disagree with Rater 1 and assigned those cases i score 2 or 3. The average conditional agreement for i score 0/1 was 57.2%, indicating that in 42.8% (1–57.2%) of the time, a second random rater would assign an i score of 2/3 to the case that the first rater considered i score 0/1. The average conditional agreements in general were 47.2% for i scores, 50.0% for ti scores, and 47.1% for i-IFTA scores. Therefore, the overall disagreement rate among all scorings was over 50%. Similar to the original 4-tier scorings, raters performed differently in some individual groups, but no single rater had significantly better or worse agreement with other raters (P = 0.983; data not shown).

Table 3 Conditional agreement probabilities for each rater on dichotomized scores.

Intra-rater reliabilities were better than inter-rater reliabilities

Table 4 shows the descriptive statistics of intra-rater reliabilities and inter-rater reliabilities on three interstitial inflammation scorings. The intra-rater reliabilities were generally better than the inter-rater reliabilities (0.37 vs. 0.27, 0.49 vs. 0.30, and 0.44 vs. 0.26 for the i score, ti score, and i-IFTA score, respectively).

Table 4 Kappa statistics of intra-rater reliabilities and inter-rater reliabilities.

Pathologists’ practicing patterns of scoring varied

Figure 2 shows the distribution of the scores assigned to the cases by each of the 8 raters. Raters displayed different tendencies on the extent of inflammation they assigned to cases, which was more noticeable in dichotomizing the scores into clinically relevant groups of none/mild (score 0/1) and moderate/severe (score 2/3) inflammation. For example, rater 5 tended to give lower scores, and rater 2 preferred higher scores. Interestingly, although the exact proportions varied, individual pathologist’s tendencies in scoring shared a common pattern across the i score, ti score, and i-IFTA. Chi-squared tests confirmed the differences in score distribution on both the 4-tier and 2-tier categorizations (P < 0.001; Supplementary Table 1 and 2). For each scoring, post hoc tests with Bonferroni correction confirmed that rater 5 consistently assigned lower scores compared to raters 2 and 7, and scores assigned by rater 6 were also significantly lower than those of rate 2.

Figure 2
figure 2

Distribution of scores assigned to the cases by each of the 8 raters. (a) i score. (b) ti score. (c) i-IFTA score. Note the similar pattern across i, ti, and i-IFTA scores on distributions of none/mild (0/1) vs. moderate/severe (2/3) groups, highlighted by broken lines.

Discussion

Interstitial inflammation scoring (i, ti, and i-ITFA) is routinely carried out and is an essential part of the diagnostic criteria for T-cell mediated rejection in the Banff Classification of Renal Allograft Pathology. Given the important roles of the scorings, reproducibility is essential. However, most of the time, the reproducibility of the scoring has been overlooked, and good reproducibility is usually taken for granted. In contrast, in this study, we showed that the inter-rater reliability of interstitial inflammation scoring is not good or even remotely reasonable. We found that pairwise inter-rater reliabilities of interstitial inflammation scoring were only fair in general and varied widely, with kappa values ranging from simply not acceptable (0.05) to moderate at most (0.52). Pathologists perform no better or worse on any particular scoring, and individual pathologists do not perform significantly different from each other. Overall, inter-rater reliabilities of interstitial inflammation scoring were unsatisfactory and showed no identifiable patterns that we could use to predict the performance of any pair of raters.

Inflammation in the non-scarred cortex (i score) has been included in the Banff classification from its establishment1, and its inter-rater reliability in a few studies7,8,9,10,11 revealed suboptimal results (kappa or intraclass correlation coefficient from 0.26 to 0.42). Although cannot be compared directly, our findings on the Banff i score (mean weighted kappa 0.27) concurred with the above studies. Interstitial inflammation is routinely assessed in allograft kidney biopsies ae well as in native kidney biopsies. Therefore, the finding of poor inter-rater reliability is not only surprising but also alarming. The Banff classification continuously evolves according to newly found evidence, and the concepts of total cortical inflammation (ti score) and inflammation in the scarred cortex (i-IFTA score) were subsequently introduced in the 2007 and 2015 meetings13,14. These scores play important roles in the diagnosis of chronic active T-cell mediated rejection. The inter-rater reliability of ti score was investigated in a study on preimplantation biopsy, and the intraclass correlation coefficient was reported to be 0.44 (fair)11. A surprisingly high reliability of i-IFTA scores (pairwise kappa values 0.60 to 0.65) was reported by the Paris group12. This latter study included only three pathologists in the same study group, which might be the reason for the unusually high agreement. We found that the inter-rater reliabilities of the Banff ti score and i-IFTA were as unsatisfactory as those of the Banff i score. However, these findings should be validated in future studies.

The above studies used different reliability coefficients (Cohen’s kappa, Fleiss kappa, and intraclass correlation coefficient) to investigate inter-rater reliability. Although all the results except the one from the Paris group are equally considered suboptimal by standards15,16, it is hard to directly compare the findings to each other due to different coefficient constructs and different sample populations. It is also very important to bear in mind that intraclass correlation coefficient or kappa statistics depend on the scoring distributions to some extent when the agreement differs across different locations on the continuous explanatory variable scale or across different categories. Using kappa statistics to infer reliability in another population requires the scoring distribution of the sample cases similar to that of the population. On the other hand, conditional agreement probabilities, given that the sample is representative, can be compared and generalized to the population.

More importantly, unlike kappa statistics or intraclass correlation coefficients, which are obscured to most nephrologists, transplant surgeons, and renal pathologists, conditional agreement probabilities give us a clear view on how well or poorly raters agree with each other and represent what we will encounter in daily practice, expressed in percentages. For example, sentences like “There is a 57.2% chance that another pathologist won’t agree with you on the extent of interstitial inflammation in this case.” are more understandable than “The kappa statistic for the inter-rater reliability of interstitial inflammation is 0.27.” A similar approach has been proposed in the field of pathology17. As expected, in concordance with the kappa statistic values, we found pathologists did not agree with each other frequently on interstitial inflammation scoring. Pathologists perform equally poorly on all three kinds of scorings, and no single pathologist performed better than others. Even after dichotomizing scores into none/mild (score 0/1) and moderate/severe (score 2/3) inflammation groups, which is more clinically relevant, disagreement rates were still over 50%. Given that interstitial inflammation is a prerequisite for the diagnosis of T-cell mediated rejection, such poor agreement on scoring might result differences in pathological diagnosis and potentially impact clinical practice. Accordingly, future studies are warranted to investigate the effects of irreproducible pathological scorings on clinical decision.

By comparing intra-rater reliabilities with inter-rater reliabilities, we found that pathologists were much more consistent with themselves but could not reconcile with each other. An intriguing and illuminating finding, evidently on charts and statistically proven, is that in contrast to the generally poor inter-rater reliabilities, raters do have consistent intrinsic practice patterns of their own. The distributions of scores assigned by individual raters were different, but at the same time, a general intra-rater pattern was visually evident across the three inflammation scorings. Chi-squared tests with post hoc two proportions tests by Bonferroni’s correction confirmed the above observations, indicating pathologists do have their own tendencies in scoring the extent of interstitial inflammation, and these tendencies were the same for i, ti, and i-IFTA scoring. For example, rater 2 consistently assigned higher scores on all three scorings than other raters. This finding partially explained the poor inter-rater reliabilities across raters but at the same time, better intra-rater reliabilities within individuals. The fixed practice pattern could be a double-edged sword. On the one hand, we would be reassured by good internal consistency at a single institution where only one pathologist reads all the biopsies. On the other hand, scores across different centers and studies are unlikely to be comparable, and general practice in the real world might be hampered by the poor inter-rater reliabilities.

We concur with Marcussen et al. in their very first study on inter-rater reliability of the Banff classification that more precise criteria for the semiquantitative scores are needed7. The visual analog scales for Banff ci and ct scores provided in a recent reference guide might be an example18. However, it has also been shown by an international study that the inter-rater reliability of the Banff i score (along with many other histological features in kidney biopsy) does not improve after persistent feedback10. Therefore, potential benefits from educational actions also need to be evaluated in future works.

In conclusion, in concordance with previous studies, we confirmed that the inter-rater reliability of Banff i scoring is moderate at most. In addition, we found pathologists’ performance on inter-rater reliabilities of the Banff ti score and i-IFTA were equally poor. By conditional agreement probabilities, we clearly showed that on average, the agreement on interstitial inflammation scores between any two pathologists was no better than chance, and the poor inter-rater reliability was at least partially rooted in pathologists’ individual preferences when scoring the extent of interstitial inflammation. The findings of the current study should alert nephrologists, transplant surgeons, and pathologists of the uncertainty of inflammation scores as well as the Banff classification and encourage more investigation into the reasons and possible remedies for the poor inter-rater reliability.

Methods

Datasets

Kidney allograft biopsies between 2018 and 2020 performed at Linkou Chang Gung Memorial Hospital were used. Fifty cases representing the full spectrum of interstitial inflammation were selected from the archive by one senior renal pathologist (T.C.), who did not participate in the subsequent scoring process for the assessment of inter-rater and intra-rater reliabilities. The slides were scanned and converted to whole-slide images with a NanoZoomer S360 Digital slide scanner C13220-01 at 400X magnification. The study was approved by the Chang Gung Medical Foundation Institutional Review Board (IRB No.: 202200101B0). All experiments were performed in accordance with relevant guidelines and regulations. Written informed consent was waived by the Chang Gung Medical Foundation Institutional Review Board. The study adhered to the Declaration of Helsinki.

Interstitial inflammation scoring by pathologists

Eight renal pathologists from 8 different hospitals in Taiwan carried out the scoring independently. All participating pathologists had been specially trained in the field of renal pathology for one or two years. The ages of the pathologists are from 32 to 64, and they have a wide range of independent sign-out experience in kidney allograft pathology from 2 to 24 years (mean = 7.6), working in different classes of institutions from primary district hospitals to tertiary medical centers. All pathologists were provided with whole-slide images of hematoxylin and eosin-, periodic acid-Schiff-, periodic acid-methenamine silver-, and trichrome-stained sections of each case and performed interstitial inflammation scoring according to their usual practice following the Banff classification18. The scorings were reported using the Banff classification for inflammation in the non-scarred cortex (i0: absent/minimal, < 10% of non-scarred cortex inflamed; i1: mild, 10–25%; i2: moderate, 26–50%; i3: severe, > 50%), total cortical inflammation (ti0: absent/minimal, < 10%; ti1: mild, 10–25%; ti2: moderate, 26–50%; ti3: severe, > 50%), and inflammation in the scarred cortex (i-IFTA0: absent/minimal, < 10% of non-scarred cortex inflamed OR if the extent of cortical IFTA is < 10%; i-IFTA1: mild, 10–25% of scarred cortex inflamed; i-IFTA2: moderate, 26%-50% of scarred cortex inflamed; i-IFTA3: severe, > 50% of scarred cortex inflamed). Five cases were excluded from the subsequent analysis due to missing scores from at least one rater. The remaining 45 cases were scored twice with an interval of at least three months for evaluation of intra-rater reliability. Because one rater might assign different scores during the first and second rounds, for evaluation of inter-rater reliability, scores from the first and second rounds were pooled together to get 90 scores that were used to obtain a balanced interpretation. For simplicity, we refer to the 90 scores as cases in the manuscript.

Conditional agreement probabilities

Conditional agreement probabilities were calculated by observed concordant scores. For each case, there were 8 different scores assigned by 8 raters. The conditional agreement of a particular rater on a specific score was calculated by the counts of the specific score divided by counts of all scores assigned by all raters to cases that the particular rater assigned that specific score. For example, regarding inflammation of the non-scarred cortex (i score), 54 out of 90 cases were scored i0 by rater 1. From these 54 cases, we had 432 (8 raters × 54 cases) scores assigned by raters. Out of these 432 scores, there were 212 i0 scores (and 220 i1, i2, or i3 scores), so the conditional agreement for rater 1 on score i0 is 212 ÷ 432 = 49.1%, meaning that 49.1% of the time, a random rater would also assign i0 to any case scored i0 by rater 1. We see our raters as a representative sample of the entire renal pathologist population. Taking out one rater as the first rater for a case has a neglectable influence on the entire population. Therefore, the first rater’s scores were not removed from the numerator and denominator.

Statistics

Inter-rater and intra-rater reliabilities were evaluated by pairwise linearly weighted Cohen’s kappa. The weighted kappa coefficients among different scorings (i, ti, and i-IFTA) were compared using a linear mixed model (LMM) which included two random effects: the intercept of the subject (the case) and the random slope of scoring. The LMM is able to deal with the dependency of the 28 pairwise kappa coefficients and 3 scorings. The distributions of scores among the eight raters were compared using chi-squared test. Comparisons between proportions were carried out by 2-proportion tests. Data analyses were conducted using SPSS 26 (IBM SPSS Inc, Chicago, Illinois). A P value less than 0.05 was considered statistically significant.

Ethical approval

The studies involving human participants were reviewed and approved by Chang Gung Medical Foundation Institutional Review Board. Written informed consent was waived by the Chang Gung Medical Foundation Institutional Review Board.