Reproducibility of the NEPTUNE descriptor-based scoring system on whole-slide images and histologic and ultrastructural digital images

Barisoni, Laura; Troost, Jonathan P; Nast, Cynthia; Bagnasco, Serena; Avila-Casado, Carmen; Hodgin, Jeffrey; Palmer, Matthew; Rosenberg, Avi; Gasim, Adil; Liensziewski, Chrysta; Merlino, Lino; Chien, Hui-Ping; Chang, Anthony; Meehan, Shane M; Gaut, Joseph; Song, Peter; Holzman, Lawrence; Gibson, Debbie; Kretzler, Matthias; Gillespie, Brenda W; Hewitt, Stephen M

doi:10.1038/modpathol.2016.58

Download PDF

Original Article
Published: 22 April 2016

Reproducibility of the NEPTUNE descriptor-based scoring system on whole-slide images and histologic and ultrastructural digital images

Laura Barisoni¹^na1,
Jonathan P Troost²,
Cynthia Nast³^na1,
Serena Bagnasco⁴,
Carmen Avila-Casado⁵,
Jeffrey Hodgin⁶,
Matthew Palmer⁷,
Avi Rosenberg⁸,
Adil Gasim⁹,
Chrysta Liensziewski¹⁰,
Lino Merlino¹,
Hui-Ping Chien¹¹,
Anthony Chang ORCID: orcid.org/0000-0002-6877-5510¹²,
Shane M Meehan¹³,
Joseph Gaut¹⁴,
Peter Song¹⁵,
Lawrence Holzman¹⁶,
Debbie Gibson²,
Matthias Kretzler¹⁰,
Brenda W Gillespie¹⁵ &
…
Stephen M Hewitt ORCID: orcid.org/0000-0001-8283-1788⁸^na1

Modern Pathology volume 29, pages 671–684 (2016)Cite this article

2250 Accesses
45 Citations
16 Altmetric
Metrics details

Subjects

Translational research

Abstract

The multicenter Nephrotic Syndrome Study Network (NEPTUNE) digital pathology scoring system employs a novel and comprehensive methodology to document pathologic features from whole-slide images, immunofluorescence and ultrastructural digital images. To estimate inter- and intra-reader concordance of this descriptor-based approach, data from 12 pathologists (eight NEPTUNE and four non-NEPTUNE) with experience from training to 30 years were collected. A descriptor reference manual was generated and a webinar-based protocol for consensus/cross-training implemented. Intra-reader concordance for 51 glomerular descriptors was evaluated on jpeg images by seven NEPTUNE pathologists scoring 131 glomeruli three times (Tests I, II, and III), each test following a consensus webinar review. Inter-reader concordance of glomerular descriptors was evaluated in 315 glomeruli by all pathologists; interstitial fibrosis and tubular atrophy (244 cases, whole-slide images) and four ultrastructural podocyte descriptors (178 cases, jpeg images) were evaluated once by six and five pathologists, respectively. Cohen's kappa for inter-reader concordance for 48/51 glomerular descriptors with sufficient observations was moderate (0.40<kappa≤0.60) for 17 and good (0.60<kappa≤0.80) for 8, for 52% with moderate or better kappas. Clustering of glomerular descriptors based on similar pathologic features improved concordance. Concordance was independent of years of experience, and increased with webinar cross-training. Excellent concordance was achieved for interstitial fibrosis and tubular atrophy. Moderate-to-excellent concordance was achieved for all ultrastructural podocyte descriptors, with good-to-excellent concordance for descriptors commonly used in clinical practice, foot process effacement, and microvillous transformation. NEPTUNE digital pathology scoring system enables novel morphologic profiling of renal structures. For all histologic and ultrastructural descriptors tested with sufficient observations, moderate-to-excellent concordance was seen for 31/54 (57%). Descriptors not sufficiently represented will require further testing. This study proffers the NEPTUNE digital pathology scoring system as a model for standardization of renal biopsy interpretation extendable outside the NEPTUNE consortium, enabling international collaborations.

Digital pathology and computational image analysis in nephropathology

Article 26 August 2020

Learning more from the inter-rater reliability of interstitial fibrosis assessment beyond just a statistic

Article Open access 15 August 2023

Next-Generation Morphometry for pathomics-data mining in histopathology

Article Open access 28 January 2023

Main

The challenge of inter-reader concordance on individual morphologic features of diagnostic renal biopsies is well documented and is highlighted in large collaborative studies.^{1, 2, 3, 4, 5} As the complexity of morphologic characterization and the number of features increase, it becomes more difficult to ensure intra- and inter-reader concordance. As one feature may show poor performance, a related and potential surrogate feature may show excellent performance and thus be preferable for routine diagnostic use. In the future, conventional interpretative diagnoses may be revised to include combined morphologic and molecular signatures.^{1, 6, 7} With these changes in the pathology practice, it is important to assess the performance of individual metrics. The past approach has been to develop metrics that demonstrate high intra-pathologist concordance and high to good inter-pathologist concordance.

The availability of digital whole-slide images allows nephropathologists to overcome limitations of conventional light microscopy analysis, and to address concordance.^{8, 9} Recent studies have demonstrated the high concordance and reliability of whole-slide images compared with conventional light microscopy evaluation for diagnoses of renal allograft rejection, as well as for individual Banff morphologic criteria.^{10, 11} Morphologic analysis of annotated peritubular capillaries on whole-slide images in Fabry’s disease suggests that by pre-selecting specific structures to be scored, achievable only by digital imaging, concordance is increased.¹²

The multicenter Nephrotic Syndrome Study Network (NEPTUNE) exemplifies a new model of systematic digital pathology review. The NEPTUNE Digital Pathology Protocol documents the whole-slide images-based scoring protocol, including selection of specific structures (eg, glomeruli) and the application of the NEPTUNE Digital Pathology Scoring System for comprehensive scoring of glomerular, vascular and tubulointerstitial morphologic features (descriptors).^{13, 14}

This study aimed to assess inter- and intra-reader concordance, and the effect of consensus review and training sessions on the NEPTUNE Digital Pathology Scoring System. The ultimate goal is to establish new models for standardization of renal biopsy morphologic profiling, and to test validated descriptors as potential predictors of diagnosis, prognosis, and response to treatment.

Materials and methods

Digital Infrastructure

Pathology material was obtained from the NEPTUNE Digital Pathology Repository, where whole-slide images (from glass slides scanned at 40 × on Hamamtsu and Aperio scanners), immunofluorescence, electron microscopy (EM) images, electronic copies of de-identified original pathology reports from cases of focal segmental glomerulosclerosis, minimal change disease, and membranous nephropathy are stored.^{15, 13}

Preparation and Training for Scoring

Descriptor reference manual and image library

A reference manual was generated and refined by webinar consensus meetings by NEPTUNE pathologists (Supplementary Figure 1). Descriptors were evaluated for clarity prior to initiation of the concordance tests. The manual was posted in the NEPTUNE digital pathology repository (see Table 1 for descriptors used in this study and Supplementary Table 6 for the comprehensive descriptor reference manual). A library of representative images was created and posted in the NEPTUNE digital pathology repository for independent review prior to initiation of the study, and then removed during the trial.

Table 1 Post study revised definitions of descriptors included in the current study

Full size table

Electronic scoring documents, material, and test instructions

Separate electronic scoring templates were generated for tubulointerstitial, ultrastructural, and glomerular scoring. The electronic matrix templates were pre-populated with '0' (absent) scoring, so reviewers needed only to select descriptors applicable to a given case/image; semiquantitative or quantitative scores required clicking on a dropdown list. For better visualization, the color of the selected cell automatically changed when a value other than 0 was selected ('0'=blue to '1'= red) (Supplementary Figure 2).

For glomerular scoring, electronic scoring templates included the list of glomeruli, and jpeg images of glomeruli were provided to all pathologists. Separate electronic scoring sheets with the lists of cases to access in the NEPTUNE digital pathology repository were provided to test tubulointerstitial descriptors such as interstitial fibrosis/tubular atrophy and ultrastructural podocyte features. Specific instructions for each of the metrics were made available. Training for data entry on the electronic scoring sheets was done during webinar meetings prior to the concordance tests.

Concordance Study Protocol

Image selection

For glomerular histologic descriptors, jpeg (joint photographic experts group) images (stained with hematoxilin and eosin, periodic acid Shift, trichrome, and silver) were obtained from both annotated whole-slide images from the NEPTUNE digital pathology repository and images previously used in a concordance study for the Columbia classification.¹⁶ For tubulointerstitial and ultrastructural podocyte descriptors, whole-slide images and EM jpeg images stored in the NEPTUNE digital pathology repository were used. All images were from previously anonymized whole-slide images or EM digital images collected in the NEPTUNE digital pathology repository following Institutional Review Board guidelines and upon approval in each participating center.

A total of 315 images of glomeruli were hand selected based on quality of the image and representation of descriptors; these images included classic examples as well as more controversial lesions. Interstitial fibrosis and tubular atrophy scoring was tested on whole-slide images from 244 cases including minimal change disease, focal segmental glomerulosclerosis, and membranous nephropathy and podocyte descriptors on 178 ultrastructural images (minimum of five EM images/case) from the minimal change disease/focal segmental glomerulosclerosis cohort.

Participating pathologists

Twelve pathologists participated in the scoring, including eight NEPTUNE pathologists (P1–8, seven of whom participated in glomerular scoring, five in interstitial fibrosis and tubular atrophy scoring, and five in podocyte scoring) and four pathologists recruited outside the NEPTUNE consortium (non-NEPTUNE pathologists) (P9–12, of whom three participated in glomerular scoring and one in interstitial fibrosis and tubular atrophy scoring). The level of experience varied between fellowship level (P7 and P9) to >10 years of experience in renal pathology (Supplementary Table 1).

Glomerular descriptor concordance tests

To assess intra- and inter-reader concordance and the effect of cross-training/consensus review on inter-reader concordance, 131 images of glomeruli were scored three times (Test I, II, and III) by seven NEPTUNE pathologists (Supplementary Figure 1). Webinar reviews occurred 2–4 weeks after each test. Washout intervals between tests varied from 2.5 to 4 months. To increase the number of inter-reader observations, 184 additional images were added to Test II for NEPTUNE pathologists. The 315 images were also scored once by three non-NEPTUNE pathologists who had one webinar training session. Images of the 131 glomeruli for Tests I and III and 315 glomeruli for Test II were reviewed during consensus webinar meetings to increase concordance in descriptor recognition.

Intra-reader concordance of glomerular descriptors was estimated by comparing each pathologist's scores from Test I vs Test II and Test II vs Test III. These estimates of concordance may be reduced as a result of webinar training; ie, gained knowledge about scoring may reduce consistency with previous scoring.

Inter-reader concordance of descriptors was estimated separately for each Test (I, II, and III), and involved computing concordance for each pair of pathologists, and pooling these estimates over all possible pairs. In addition to the overall estimate of inter-reader concordance, we were interested in four research questions: (a) whether continuous cross-training improved concordance, (b) whether concordance differed by the pathologist's experience, (c) whether concordance was higher using clusters of descriptors sharing similar features than for individual descriptors, and (d) whether concordance was maintained outside the NEPTUNE investigators.

Tubulointerstitial descriptor concordance tests

To test concordance of non-glomerular parameters, we considered the most clinically relevant tubulointerstitial parameters,^{17, 18, 19, 20, 21, 22} the percentage (0–100%) of cortex involved by interstitial fibrosis and tubular atrophy, for 244 cases. Conventional pathology practice includes semiquantitative assessment of interstitial fibrosis and tubular atrophy. Therefore, interstitial fibrosis and tubular atrophy scoring was not preceded by webinar training and was performed only once by six pathologists.

Podocyte descriptor concordance tests

Although ultrastructural evaluation of podocyte morphology is common in pathology practice, estimates of some ultrastructural parameters are often not reported. Thus, the podocyte descriptor test was preceded by a webinar session to review definitions reflecting effacement, condensation of actin-based cytoskeleton, microvillous transformation, and loss of primary processes. Ultrastructural podocyte descriptors were scored by five pathologists with 1 to >10 years of experience on 178 cases (minimal change disease/focal segmental glomerulosclerosis) as follows: foot process effacement: 0=1–10%, 1+=11–25%, 2+=26–50%, 3+=51–75%, and 4+=>75%; condensation of actin-based cytoskeleton and microvillous transformation: 0=not observed, 1+=segmental (≤50%), 2+=global (>50%); loss of primary processes was scored as absent (0) or present (1+).

Statistical Methods

For the (dichotomous) glomerular descriptors, intra-reader agreement was assessed by both Cohen's kappa and pathologist-specific counts of the number of descriptors that the pathologist rated the same way in two consecutive readings. Inter-reader agreement between pairs of pathologists was also estimated using Cohen's kappa, but Fleiss' kappa²³ was used to estimate inter-reader agreement pooled across all pathologists. The variability in kappa values across pairs of pathologists for each descriptor is shown using boxplots.

Scoring was also performed for clusters of glomerular descriptors sharing morphologic similarities. A cluster was judged to be present if at least one descriptor of the cluster was present, and Fleiss' kappa was used to assess pooled inter-reader agreement among pathologists. The kappa statistic ranges from −1 (perfect disagreement) to 1 (perfect agreement), with a value of 0 indicating agreement expected by chance alone. Kappa statistics were categorized and interpreted as: >0.80 (excellent); 0.61–0.80 (good); 0.41–0.60 (moderate); 0.21–0.40 (fair); 0–0.20 (poor); and <0 (no agreement) (http://healthcare-economist.com/2011/11/02/kappa-statistic). Because kappa is smaller with lower prevalence of the finding under observation, we report the range over pathologists of the number of glomeruli in which each descriptor was observed. Although we calculated kappa statistics for all descriptors with at least one pathologist rating, some results exclude descriptors with insufficient observations, defined as the maximum over all pathologists of the number of glomeruli in which the descriptor was observed being less than five.

We investigated the four research questions listed above as follows: (a) to assess whether inter-reader concordance could improve with cross-training we evaluated the number of descriptors that increased in concordance between Tests I and II, and between Tests I and III; (b) to assess whether inter-reader concordance depended on the pathologist's years of experience, we compared the kappas from all pathologists with the kappas excluding the trainees; (c) to assess the effect of scoring descriptor clusters, we visually compared cluster concordance with individual concordance estimates for each descriptor in the cluster; and d) to assess whether concordance was maintained outside NEPTUNE investigators we compared concordance among the three non-NEPTUNE pathologists and among the seven NEPTUNE pathologists using the 315 glomerular images from Test II. The Neptune and non-Neptune summary kappas were compared by paired t-test.

For the continuous interstitial fibrosis and tubular atrophy scores, inter-reader agreement was estimated using Pearson's correlation coefficient on all pairs of pathologists (the pathologist with more vs less years of experience). For the ordinal podocyte descriptors, Kendall's coefficient of concordance was used to assess inter-reader agreement for pairs of pathologists.

Results

Intra-reader Concordance for Glomerular Descriptors

When comparing glomerular intra-reader concordance Test I vs Test II and Test II vs Test III, the average intra-reader concordance for glomerular descriptors increased with cross-training/consensus webinars. (Supplementary Table 2 and Supplementary Table 3). When comparing glomerular concordance Test II versus Test III, there were four descriptors for which all pairs of readers had good concordance, and 11 descriptors where all pairs had at least moderate concordance (Supplementary Table 3). Interestingly, inconsistent intra-reader concordance was noted for lesions of segmental sclerosis corresponding to 'perihilar' and 'not otherwise specified' variants of the Columbia classification.²⁴ At least moderate intra-reader agreement was found for most of the descriptors commonly associated with segmental sclerosis or collapse, such as various form of hyalinosis, podocyte hypertrophy, foam cells or periglomerular fibrosis. Unexpected inconsistency in intra-reader agreement was noted for basic lesions such as global sclerosis, although other forms of global damage (obsolescence, global collapse, deflation and spikes) were more consistently recognized.

Inter-reader Concordance for Glomerular Descriptors

For the 315 glomeruli (Test II), 48/51 glomerular descriptors had sufficient data for evaluation. The kappa statistics from the combined NEPTUNE and non-NEPTUNE pathologists represent our current best summaries of this investigation. Based on these results, 8/48 descriptors had good inter-reader concordance; these included descriptors indicating global lesions (global spikes, deflation, collapse, and obsolescence) and segmental lesions (foam cells, cellular tip lesion, segmental deflation necrosis). An additional 17/48 descriptors had moderate concordance for a total of 52% of descriptors tested having an inter-reader Cohen’s kappa ≥0.40 (Table 3). Concordance between pairs of pathologists varied widely by pair and by descriptor, but most had moderate or better concordance (Figure 1a and b).

The overall inter-reader concordance increased with cross-training from Test I through Test III among NEPTUNE pathologists in the set of 131 glomerular images. Of the 51 glomerular descriptors tested, 19 were not sufficiently represented to evaluate inter-reader concordance. For 32 descriptors with sufficient data for comparison, 56% had improved kappas between Tests I and II, and 63% between Tests I and III. Five descriptors improved the initial kappas of moderate to good or excellent, including global lesions (such as global deflation), segmental lesions (mid-glomerular segmental sclerosis and hyalinosis at the vascular pole), and the descriptor indicating no abnormalities. An additional three descriptors (cellular non-tip lesions, periglomerular fibrosis, and global podocyte hyperplasia) increased performance from fair/poor to moderate or good. (Table 1, Table 2, Figure 1a and b).

Table 2 Inter-reader concordance of all 51 glomerular descriptors by Test (I, II, and III) and NEPTUNE/non-NEPTUNE affiliation

Full size table

As expected, better concordance was achieved in most cases by clustering descriptors together. Compared with the cluster kappas, most component kappas are substantially smaller. However, for five of the clusters, a single component kappa was larger than the cluster kappa, showing that clustering often, but not always, leads to optimum concordance. Concordance improved when selected descriptors for sclerosing/obliterating lesions or for epithelial cell (podocytes) damage were combined (Table 3).

Table 3 Inter-reader agreement (Cohen's kappa) of NEPTUNE pathologists for clusters of glomerular descriptors assessed in Test I, II, and III (131 glomeruli)

Full size table

Concordance was independent of years of experience; analysis excluding the data generated by the trainees did not change significantly the overall concordance (data not shown). NEPTUNE and non-NEPTUNE pathologists had comparable overall inter-reader kappas (mean difference between kappas=0.015, paired t-test P=0.502).

Inter-reader Concordance for Tubulointerstitial Parameters

Excellent concordance was seen for both interstitial fibrosis and tubular atrophy, independent of years of experience (Figure 2; Figure 3d and e; Supplementary Table 4). In addition, overall concordance for interstitial fibrosis and tubular atrophy scoring remained consistently excellent when analyzed separately for each disease (minimal change disease, focal segmental glomerulosclerosis, and membranous nephropathy; data not shown).

Inter-reader Concordance for Podocyte Descriptors

Concordance was excellent for foot process effacement and good for microvillous transformation and condensation of the actin cytoskeleton, and moderate for loss of primary processes. (Figures 3f–i and 4; Supplementary Table 5).

Descriptor Reference Manual Revision

At the end of the study the descriptor reference manual was revised during several consensus webinar sessions that included NEPTUNE pathologists as well as pathologists outside the consortium, and language was added to improve clarity of definitions (Supplementary Table 6).

Discussion

To take advantage of and coordinate with new findings being discovered in molecular nephrology, renal pathologists must identify methodologies and approaches that allow for better integration of morphologic evaluation creating more compelling diagnostic paradigms.¹⁴ Furthermore, it is critical to design and implement classification systems for clinical research that are more meaningful with regard to novel renal biomarkers, prognosis, and treatment approaches.¹ The use of such morphologic observations requires concordance of pathologic analysis across diseases, level of training and experience. One goal of the NEPTUNE consortium is to identify reproducible morphologic variables that can be implemented in clinical practice by creating a new taxonomy of renal diseases. Toward that goal, we carried out a study testing intra- and inter-pathologist concordance using a set of 51 glomerular, two tubulointerstitial and four ultrastructural features.

The first critical step toward a robust morphologic evaluation was the establishment of well defined morphologic criteria documented in a reference manual. The NEPTUNE digital pathology scoring system reference manual is comprehensive of features included by other classification systems and we referred to previously published criteria for some of the descriptors;^{5, 25} however, many of the descriptors listed in the NEPTUNE digital pathology scoring system, although used in clinical practice to some degree, were not thoroughly defined by consensus and organized in a comprehensive reference manual prior this study.

An innovative contribution of this study is the development of a protocol exploiting digital pathology technology. The introduction of digital pathology into large-scale glomerular disease research has enabled simultaneous remote access of multiple users.^{1, 11, 13, 17, 26, 27} The application of digital technology, and of software for annotation of glomeruli, offers the opportunity to systematically eliminate glomerular selection bias, providing the basis for potentially higher inter-observer concordance.¹² Although it is intuitive that there are minimal differences in concordance when scoring interstitial fibrosis and tubular atrophy by conventional light microscopy or whole-slide images, recognizing the value of specifically selecting structures to be evaluated, a recent concordance study was conducted using single digital images of glomeruli to identify the five patterns of focal segmental glomerulosclerosis (Columbia classification). This strategy, eliminating the glomerular selection bias, resulted in an overall good agreement among the six pathologists.¹⁶ In our study, we partially mimic the strategy utilized by Meehan et al¹⁶ by capturing digital images of individual annotated glomeruli from the whole-slide images of the 400 cases stored in the NEPTUNE digital pathology repository. By controlling the modality of the image review, the observations made, while under the control of the pathologist, were consistent with regard to image quality and to some extent magnification between reviewers. Using this approach, we were able to apply an 'object oriented' evaluation of performance, rather than a specimen-based approach.

Concordance of individual descriptors and factors contributing to concordance: most concordance studies are based on a one-time assessment. In our study, we demonstrated that concordance is modifiable by cross-training over time. This approach was tested in a study on thymic epithelial neoplasms, and resulted in post-webinar training improved concordance, confirming the value of digital pathology as an educational tool.²⁷ Although the inter-reader discrepancies in our study may appear significant, the total number of parameters involved for which pathologists needed cross-training, compared with a single diagnosis of epithelial neoplasia in the Wang’s study, was much greater. Intra-reader concordance also improved with cross-training and webinar-based consensus as more detailed and objective criteria were provided to the participants, lessening individual reluctance in changing internal/subjective criteria. Thus, we still consider our observations encouraging for the systematic application of webinar cross-training to increase intra- and inter-reader concordance.

The best performance was obtained by the interstitial fibrosis and tubular atrophy score, with overall excellent inter-reader concordance despite the lack of previous webinar training. Similar high concordance was obtained in the Oxford classification study.⁵ We hypothesize that this excellent performance is a consequence of the routine scoring of interstitial fibrosis and tubular atrophy in renal biopsy practice. Similarly, concordance was proportional to the frequency the ultrastructural podocyte descriptors are used in routine renal pathology assessment of biopsies; the highest concordance was recorded for the most commonly used parameter (foot process effacement) and the lowest for the descriptor used only experimentally (loss of primary process).²⁸ These data raise the question of whether descriptors for which familiarity and training are inadequate should be used and included in future studies. Developing robust training tools and metrics of performance is critical, as these infrequently assessed lesions may demonstrate correlation with clinical or molecular parameters and may add value to morphologic analysis or classifications. The continuous cross-training approach may ultimately prevent future classification systems from excluding morphologic criteria initially not performing well, but that may still have great potential as predictors of outcome. This concept may alter the current approach to generating classifications, which currently select for morphologic features based on concordance, to including initially less reproducible but valuable observational data by introducing post-training amendment and adjustment options. Should this occur, greater use of such features in routine clinical practice would then increase familiarity and automatically improve concordance.

The uneven level of concordance of some glomerular histologic descriptors is not easily explained. Although we eliminated the glomerular selection bias and provided a prefilled electronic scoring sheet listing all possible descriptors, lack of reproducibility for some descriptors may derive by failure to see or forgetting to mark a specific lesion among others affecting the same glomerulus, whereas for descriptors that are present in isolation, such as global spikes, it may have been easier to maintain the focus. Whereas global collapse or capillary wall spikes had expected high concordance, variable concordance was observed for subtypes of global or segmental sclerosis, although when consolidated under global or segmental obliteration, overall performance increased. The high concordance of segmental obliteration as an overall category confirms the data obtained by the Oxford classification study, where segmental sclerosis was defined as solidification/obliteration involving any part of the tuft and not broken down in subtypes based on location or cellularity.⁵ The lack of consistency in recognizing the type of segmental sclerosis may appear to challenge the value of the conventional classification system of focal segmental glomerulosclerosis.²⁵ While low concordance is experienced when using individual descriptors defining the subtypes of segmental sclerosis, the application of the Columbia classification system at the glomerular level may have better concordance.¹⁶ The paradox that summary diagnostic approaches, rather than lesion-driven diagnostic paradigms, have better performance in concordance studies suggest that pathologists are using the totality of the histopathology to arrive at a diagnosis. This 'holistic' approach may be diagnostically powerful, but may limit prognostic utility, which is better elucidated by feature-based criteria. In addition, although all participants recognized epithelial cell (podocyte) injury, there were features that were inconsistently identified across reviewers, with the greatest difficulty in differentiating segmental vs global lesions and hyperplasia vs hypertrophy. When segmental and global or hypertrophy and hyperplasia were combined, concordance increased. Good concordance was obtained by combining all podocyte abnormalities. It also appeared that the challenge in identifying segmental vs global lesions is not limited to podocytes but also applicable to mesangial cell proliferation. Again, by combining segmental and global mesangial cell proliferation, the kappa coefficient increased in the 315 glomeruli study to 0.64, confirming that the overall mesangial cell proliferation has adequate concordance to be included in classification systems.⁵ The poor concordance of these features suggests that they require additional refinement and evaluation before inclusion in classification systems where, for example, the recognition of segmental vs global damage/proliferation may drive therapeutic choices.²⁹ Additional studies, currently in process, have been developed with the goals of (a) re-testing this approach provided more training, (b) testing reproducibility in the context of a European-based (EURenOmics) and Chinese-based (NEPTUNE-China) study by a different set of reviewing pathologists applying the NEPTUNE digital pathology scoring system, (c) testing all NEPTUNE descriptors using different metrics (for example continuous vs dichotomous), and (d) applying other statistical methods.

When comparing data from NEPTUNE pathologists after several training sessions to non-NEPTUNE pathologists, the overall concordance was in favor of the NEPTUNE pathologists, although on 315 glomeruli the number of descriptors with a good or excellent concordance was greater for non-NEPTUNE pathologists. Several factors may have contributed to this result, including variability among pathologists.

This study also addressed whether concordance depended on years of experience in clinical practice. The overall coefficient of concordance did not change with the exclusion of pathologists in training. Trainees are accustomed to individual feature recognition as part of the learning process compared to experienced pathologists who are used to pattern recognition summarizing individual features into a diagnosis line.

After post-study revision of the reference manual to add clarity to the descriptor definitions (Supplementary Table 6), the NEPTUNE digital pathology scoring system and protocol were shared and implemented by other multicenter consortia with the generation of an INTEGRATE (INTErnational diGital nephRopAThology nEtwork) between pathologists from North America (NEPTUNE), Europe (EURenOmics) and Asia (China-DiKip).

In conclusion, the NEPTUNE digital pathology scoring system provides comprehensive analysis of renal structures with good-to-excellent concordance for many parameters. Although previous classification systems have eliminated poorly performing descriptors,⁵ here we provide an alternative model that maintains the original scoring metrics, but applies summary measures of clustered features and recommends continuing cross/training and consensus meetings. As metrics should ultimately be measured against their contribution to outcome and to guiding therapy, the rationale in favor of improving performance in contrast to dropping descriptors is that these descriptors have potential for important clinical value. Thus, this novel protocol for continuous improvement may serve as a model with potential to modify current classification systems, applicable across multiple international consortia, enabling world-wide collaboration and compilation of permanently recordable granular observational data suitable for correlation with clinical and molecular profiling of glomerular diseases.

References

Adam B, Randhawa P, Chan S et al. Banff initiative for quality assurance in transplantation (BIFQUIT): reproducibility of polyomavirus immunohistochemistry in kidney allografts. Am J Transplant 2014:14:2137–2147.
Article CAS PubMed PubMed Central Google Scholar
Haas M, Sis B, Racusen LC et al. Banff 2013 meeting report: inclusion of c4d-negative antibody-mediated rejection and antibody-associated arterial lesions. Am J Transplant 2014:14:272–283.
Article CAS PubMed Google Scholar
Roberts CA, Beitsch PD, Litz CE et al. Interpretive disparity among pathologists in breast sentinel lymph node evaluation. Am J Surg 2003:186:324–329.
Article PubMed Google Scholar
Roberts JM, Jin F, Thurloe JK et al. High reproducibility of histological diagnosis of human papillomavirus-related intraepithelial lesions of the anal canal. Pathology 2015:47:308–313.
Article PubMed Google Scholar
Working Group of the International Ig ANN, the Renal Pathology S Roberts IS Cook HT et al. The Oxford classification of IgA nephropathy: pathology definitions, correlations, and reproducibility. Kidney Int 2009:76:546–556.
Article Google Scholar
Barisoni L, Schnaper HW, Kopp JB . A proposed taxonomy for the podocytopathies: a reassessment of the primary nephrotic diseases. Clin J Am Soc Nephrol 2007:2:529–542.
Article PubMed Google Scholar
Barisoni L, Schnaper HW, Kopp JB . Advances in the biology and genetics of the podocytopathies: implications for diagnosis and therapy. Arch Pathol Lab Med 2009:133:201–216.
PubMed PubMed Central Google Scholar
Polley MY, Leung SC, Gao D et al. An international study to increase concordance in Ki67 scoring. Mod Pathol 2015:28:778–786.
Article PubMed Google Scholar
Polley MY, Leung SC, McShane LM et al. An international Ki67 reproducibility study. J Natl Cancer Inst 2013:105:1897–1906.
Article PubMed PubMed Central Google Scholar
Jen KY, Olson JL, Brodsky S et al. Reliability of whole slide images as a diagnostic modality for renal allograft biopsies. Hum Pathol 2013:44:888–894.
Article PubMed Google Scholar
Ozluk Y, Blanco PL, Mengel M et al. Superiority of virtual microscopy versus light microscopy in transplantation pathology. Clin Transplant 2012:26:336–344.
Article PubMed Google Scholar
Barisoni L, Jennette JC, Colvin R et al. Novel quantitative method to evaluate globotriaosylceramide inclusions in renal peritubular capillaries by virtual microscopy in patients with fabry disease. Arch Pathol Lab Med 2012:136:816–824.
Article PubMed Google Scholar
Barisoni L, Nast CC, Jennette JC et al. Digital pathology evaluation in the multicenter Nephrotic Syndrome Study Network (NEPTUNE). Clin J Am Soc Nephrol 2013:8:1449–1459.
Article PubMed PubMed Central Google Scholar
Nast CC, Lemley KV, Hodgin JB et al. Morphology in the digital age: integrating high-resolution description of structural alterations with phenotypes and genotypes. Semin Nephrol 2015:35:266–278.
Article PubMed PubMed Central Google Scholar
Gadegbeku CA, Gipson DS, Holzman LB et al. Design of the Nephrotic Syndrome Study Network (NEPTUNE) to evaluate primary glomerular nephropathy by a multidisciplinary approach. Kidney Int 2013:83:749–756.
Article PubMed PubMed Central Google Scholar
Meehan SM, Chang A, Gibson IW et al. A study of interobserver reproducibility of morphologic lesions of focal segmental glomerulosclerosis. Virchows Arch 2013:462:229–237.
Article PubMed Google Scholar
Ford SL, Polkinghorne KR, Longano A et al. Histopathologic and clinical predictors of kidney outcomes in ANCA-associated vasculitis. Am J Kidney Dis 2014:63:227–235.
Article PubMed Google Scholar
Lemley KV . Diabetes and chronic kidney disease: lessons from the Pima Indians. Pediatr Nephrol 2008:23:1933–1940.
Article PubMed Google Scholar
Miettinen J, Helin H, Pakarinen M et al. Histopathology and biomarkers in prediction of renal function in children after kidney transplantation. Transpl Immunol 2014:31:105–1.
Article CAS PubMed Google Scholar
Mise K, Hoshino J, Ubara Y et al. Renal prognosis a long time after renal biopsy on patients with diabetic nephropathy. Nephrol Dial Transplant 2014:29:109–118.
Article CAS PubMed Google Scholar
Mise K, Hoshino J, Ueno T et al. Clinical and pathological predictors of estimated GFR decline in patients with type 2 diabetes and overt proteinuric diabetic nephropathy. Diabetes Metab Res Rev 2015:31:572–581.
Article CAS PubMed Google Scholar
Mise K, Hoshino J, Ueno T et al. Impact of tubulointerstitial lesions on anaemia in patients with biopsy-proven diabetic nephropathy. Diabet Med 2015:32:546–5.
Article CAS PubMed Google Scholar
Landis JR, Koch GG . The measurement of observer agreement for categorical data. Biometrics 1977:33:159–174.
Article CAS PubMed Google Scholar
Thomas DB, Franceschini N, Hogan SL et al. Clinical and pathologic characteristics of focal segmental glomerulosclerosis pathologic variants. Kidney Int 2006:69:920–926.
Article CAS PubMed Google Scholar
D'Agati VD, Fogo AB, Bruijn JA et al. Pathologic classification of focal segmental glomerulosclerosis: a working proposal. Am J Kidney Dis 2004:43:368–382.
Article PubMed Google Scholar
Gavrielides MA, Conway C, O'Flaherty N et al. Observer performance in the use of digital and optical microscopy for the interpretation of tissue-based biomarkers. Anal Cell Pathol (Amst) 2014.
Wang H, Sima CS, Beasley MB et al. Classification of thymic epithelial neoplasms is still a challenge to thoracic pathologists: a reproducibility study using digital microscopy. Arch Pathol Lab Med 2014:138:658–663.
Article PubMed Google Scholar
Barisoni L, Kriz W, Mundel P et al. The dysregulated podocyte phenotype: a novel concept in the pathogenesis of collapsing idiopathic focal segmental glomerulosclerosis and HIV-associated nephropathy. J Am Soc Nephrol 1999:10:51–61.
CAS PubMed Google Scholar
Weening JJ, D'Agati VD, Schwartz MM et al. The classification of glomerulonephritis in systemic lupus erythematosus revisited. J Am Soc Nephrol 2004:15:241–250.
Article PubMed Google Scholar

Download references

Acknowledgements

The Nephrotic Syndrome Study Network Consortium (NEPTUNE) is a part of the National Center for Advancing Translational Sciences (NCATS), the Rare Disease Clinical Research Network (RDCRN), and is supported through a collaboration between the Office of Rare Diseases Research (ORDR), NCATS, and the National Institute of Diabetes, Digestive, and Kidney Diseases. RDCRN is an initiative of ORDR and NCATS. Additional funding and/or programmatic support for this project has also been provided by the University of Michigan, NephCure Kidney International, and the Halperin Foundation. This study was supported by a NEPTUNE pilot award. We thank Dr Charles Jennette for his contribution to creating the descriptor manual and participation in the training webinar sessions, and Dr William Smoyer for critical review of the manuscript.

Author information

Laura Barisoni, Cynthia Nast and Stephen M Hewitt: Neptune leading pathologists for digital imaging and cases collection in the Neptune Digital Pathology Repository.

Authors and Affiliations

Department of Pathology, University of Miami, Miller School of Medicine, Miami, FL, USA
Laura Barisoni & Lino Merlino
Department of Pediatrics, Division of Pediatric Nephrology, University of Michigan, Ann Arbor, MI, USA
Jonathan P Troost & Debbie Gibson
Department of Pathology, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Cynthia Nast
Department of Pathology, Johns Hopkins University, Baltimore, MD, USA
Serena Bagnasco
Department of Pathology, University of Toronto, Ontario, Canada, USA
Carmen Avila-Casado
Department of Pathology, University of Michigan, Ann Arbor, MI, USA
Jeffrey Hodgin
Department of Pathology, University of Pennsylvania, Philadelphia, PA, USA
Matthew Palmer
Laboratory of Pathology, National Cancer Institute, Bethesda, MD, USA
Avi Rosenberg & Stephen M Hewitt
Department of Pathology, University of North Carolina, Chapel Hill, NC, USA
Adil Gasim
Department of Medicine, Division of Nephrology, University of Michigan, Ann Arbor, MI, USA
Chrysta Liensziewski & Matthias Kretzler
Department of Pathology, Keelung Chang Gung Memorial Hospital, Keelung, Taiwan
Hui-Ping Chien
Department of Pathology, University of Chicago, Chicago, IL, USA
Anthony Chang
Department of Pathology, Sharp Memorial Hospital, San Diego, CA, USA
Shane M Meehan
Department of Pathology, Washington University, St. Louis, MO, USA
Joseph Gaut
Biostatistics Department, School of Public Health, University of Michigan, Ann Arbor, MI, USA
Peter Song & Brenda W Gillespie
Department of Medicine, Division of Nephrology, University of Pennsylvania, Philadelphia, PA, USA
Lawrence Holzman

Authors

Laura Barisoni
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan P Troost
View author publications
You can also search for this author in PubMed Google Scholar
Cynthia Nast
View author publications
You can also search for this author in PubMed Google Scholar
Serena Bagnasco
View author publications
You can also search for this author in PubMed Google Scholar
Carmen Avila-Casado
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Hodgin
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Palmer
View author publications
You can also search for this author in PubMed Google Scholar
Avi Rosenberg
View author publications
You can also search for this author in PubMed Google Scholar
Adil Gasim
View author publications
You can also search for this author in PubMed Google Scholar
Chrysta Liensziewski
View author publications
You can also search for this author in PubMed Google Scholar
Lino Merlino
View author publications
You can also search for this author in PubMed Google Scholar
Hui-Ping Chien
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Chang
View author publications
You can also search for this author in PubMed Google Scholar
Shane M Meehan
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Gaut
View author publications
You can also search for this author in PubMed Google Scholar
Peter Song
View author publications
You can also search for this author in PubMed Google Scholar
Lawrence Holzman
View author publications
You can also search for this author in PubMed Google Scholar
Debbie Gibson
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Kretzler
View author publications
You can also search for this author in PubMed Google Scholar
Brenda W Gillespie
View author publications
You can also search for this author in PubMed Google Scholar
Stephen M Hewitt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Laura Barisoni or Brenda W Gillespie.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on Modern Pathology website

Supplementary information

Supplementary Table 1 (DOC 266 kb)

Supplementary Table 2 (DOC 203 kb)

Supplementary Table 3 (DOC 307 kb)

Supplementary Table 4 (DOC 207 kb)

Supplementary Table 5 (DOC 239 kb)

Supplementary Table 6 (DOC 94 kb)

Supplementary Information (DOCX 10 kb)

Supplementary Figure 1 (JPG 212 kb)

Supplementary Figure 2 (JPG 712 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barisoni, L., Troost, J., Nast, C. et al. Reproducibility of the NEPTUNE descriptor-based scoring system on whole-slide images and histologic and ultrastructural digital images. Mod Pathol 29, 671–684 (2016). https://doi.org/10.1038/modpathol.2016.58

Download citation

Received: 03 November 2015
Revised: 12 February 2016
Accepted: 13 February 2016
Published: 22 April 2016
Issue Date: July 2016
DOI: https://doi.org/10.1038/modpathol.2016.58

This article is cited by

Chaotic quantization based JPEG for effective compression of whole slide images
- Fırat Artuğer
- Fatih Özkaynak
The Visual Computer (2022)
Genetic studies of focal segmental glomerulosclerosis: a waste of scientific time?
- Alexander J. Howie
Pediatric Nephrology (2020)

Subjects

Abstract

Similar content being viewed by others

Main

Materials and methods

Digital Infrastructure

Preparation and Training for Scoring

Descriptor reference manual and image library

Electronic scoring documents, material, and test instructions

Concordance Study Protocol

Image selection

Participating pathologists

Glomerular descriptor concordance tests

Tubulointerstitial descriptor concordance tests

Podocyte descriptor concordance tests

Statistical Methods

Results

Intra-reader Concordance for Glomerular Descriptors

Inter-reader Concordance for Glomerular Descriptors

Inter-reader Concordance for Tubulointerstitial Parameters

Inter-reader Concordance for Podocyte Descriptors

Descriptor Reference Manual Revision

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links