Main

The challenge of inter-reader concordance on individual morphologic features of diagnostic renal biopsies is well documented and is highlighted in large collaborative studies.1, 2, 3, 4, 5 As the complexity of morphologic characterization and the number of features increase, it becomes more difficult to ensure intra- and inter-reader concordance. As one feature may show poor performance, a related and potential surrogate feature may show excellent performance and thus be preferable for routine diagnostic use. In the future, conventional interpretative diagnoses may be revised to include combined morphologic and molecular signatures.1, 6, 7 With these changes in the pathology practice, it is important to assess the performance of individual metrics. The past approach has been to develop metrics that demonstrate high intra-pathologist concordance and high to good inter-pathologist concordance.

The availability of digital whole-slide images allows nephropathologists to overcome limitations of conventional light microscopy analysis, and to address concordance.8, 9 Recent studies have demonstrated the high concordance and reliability of whole-slide images compared with conventional light microscopy evaluation for diagnoses of renal allograft rejection, as well as for individual Banff morphologic criteria.10, 11 Morphologic analysis of annotated peritubular capillaries on whole-slide images in Fabry’s disease suggests that by pre-selecting specific structures to be scored, achievable only by digital imaging, concordance is increased.12

The multicenter Nephrotic Syndrome Study Network (NEPTUNE) exemplifies a new model of systematic digital pathology review. The NEPTUNE Digital Pathology Protocol documents the whole-slide images-based scoring protocol, including selection of specific structures (eg, glomeruli) and the application of the NEPTUNE Digital Pathology Scoring System for comprehensive scoring of glomerular, vascular and tubulointerstitial morphologic features (descriptors).13, 14

This study aimed to assess inter- and intra-reader concordance, and the effect of consensus review and training sessions on the NEPTUNE Digital Pathology Scoring System. The ultimate goal is to establish new models for standardization of renal biopsy morphologic profiling, and to test validated descriptors as potential predictors of diagnosis, prognosis, and response to treatment.

Materials and methods

Digital Infrastructure

Pathology material was obtained from the NEPTUNE Digital Pathology Repository, where whole-slide images (from glass slides scanned at 40 × on Hamamtsu and Aperio scanners), immunofluorescence, electron microscopy (EM) images, electronic copies of de-identified original pathology reports from cases of focal segmental glomerulosclerosis, minimal change disease, and membranous nephropathy are stored.15, 13

Preparation and Training for Scoring

Descriptor reference manual and image library

A reference manual was generated and refined by webinar consensus meetings by NEPTUNE pathologists (Supplementary Figure 1). Descriptors were evaluated for clarity prior to initiation of the concordance tests. The manual was posted in the NEPTUNE digital pathology repository (see Table 1 for descriptors used in this study and Supplementary Table 6 for the comprehensive descriptor reference manual). A library of representative images was created and posted in the NEPTUNE digital pathology repository for independent review prior to initiation of the study, and then removed during the trial.

Table 1 Post study revised definitions of descriptors included in the current study

Electronic scoring documents, material, and test instructions

Separate electronic scoring templates were generated for tubulointerstitial, ultrastructural, and glomerular scoring. The electronic matrix templates were pre-populated with '0' (absent) scoring, so reviewers needed only to select descriptors applicable to a given case/image; semiquantitative or quantitative scores required clicking on a dropdown list. For better visualization, the color of the selected cell automatically changed when a value other than 0 was selected ('0'=blue to '1'= red) (Supplementary Figure 2).

For glomerular scoring, electronic scoring templates included the list of glomeruli, and jpeg images of glomeruli were provided to all pathologists. Separate electronic scoring sheets with the lists of cases to access in the NEPTUNE digital pathology repository were provided to test tubulointerstitial descriptors such as interstitial fibrosis/tubular atrophy and ultrastructural podocyte features. Specific instructions for each of the metrics were made available. Training for data entry on the electronic scoring sheets was done during webinar meetings prior to the concordance tests.

Concordance Study Protocol

Image selection

For glomerular histologic descriptors, jpeg (joint photographic experts group) images (stained with hematoxilin and eosin, periodic acid Shift, trichrome, and silver) were obtained from both annotated whole-slide images from the NEPTUNE digital pathology repository and images previously used in a concordance study for the Columbia classification.16 For tubulointerstitial and ultrastructural podocyte descriptors, whole-slide images and EM jpeg images stored in the NEPTUNE digital pathology repository were used. All images were from previously anonymized whole-slide images or EM digital images collected in the NEPTUNE digital pathology repository following Institutional Review Board guidelines and upon approval in each participating center.

A total of 315 images of glomeruli were hand selected based on quality of the image and representation of descriptors; these images included classic examples as well as more controversial lesions. Interstitial fibrosis and tubular atrophy scoring was tested on whole-slide images from 244 cases including minimal change disease, focal segmental glomerulosclerosis, and membranous nephropathy and podocyte descriptors on 178 ultrastructural images (minimum of five EM images/case) from the minimal change disease/focal segmental glomerulosclerosis cohort.

Participating pathologists

Twelve pathologists participated in the scoring, including eight NEPTUNE pathologists (P1–8, seven of whom participated in glomerular scoring, five in interstitial fibrosis and tubular atrophy scoring, and five in podocyte scoring) and four pathologists recruited outside the NEPTUNE consortium (non-NEPTUNE pathologists) (P9–12, of whom three participated in glomerular scoring and one in interstitial fibrosis and tubular atrophy scoring). The level of experience varied between fellowship level (P7 and P9) to >10 years of experience in renal pathology (Supplementary Table 1).

Glomerular descriptor concordance tests

To assess intra- and inter-reader concordance and the effect of cross-training/consensus review on inter-reader concordance, 131 images of glomeruli were scored three times (Test I, II, and III) by seven NEPTUNE pathologists (Supplementary Figure 1). Webinar reviews occurred 2–4 weeks after each test. Washout intervals between tests varied from 2.5 to 4 months. To increase the number of inter-reader observations, 184 additional images were added to Test II for NEPTUNE pathologists. The 315 images were also scored once by three non-NEPTUNE pathologists who had one webinar training session. Images of the 131 glomeruli for Tests I and III and 315 glomeruli for Test II were reviewed during consensus webinar meetings to increase concordance in descriptor recognition.

Intra-reader concordance of glomerular descriptors was estimated by comparing each pathologist's scores from Test I vs Test II and Test II vs Test III. These estimates of concordance may be reduced as a result of webinar training; ie, gained knowledge about scoring may reduce consistency with previous scoring.

Inter-reader concordance of descriptors was estimated separately for each Test (I, II, and III), and involved computing concordance for each pair of pathologists, and pooling these estimates over all possible pairs. In addition to the overall estimate of inter-reader concordance, we were interested in four research questions: (a) whether continuous cross-training improved concordance, (b) whether concordance differed by the pathologist's experience, (c) whether concordance was higher using clusters of descriptors sharing similar features than for individual descriptors, and (d) whether concordance was maintained outside the NEPTUNE investigators.

Tubulointerstitial descriptor concordance tests

To test concordance of non-glomerular parameters, we considered the most clinically relevant tubulointerstitial parameters,17, 18, 19, 20, 21, 22 the percentage (0–100%) of cortex involved by interstitial fibrosis and tubular atrophy, for 244 cases. Conventional pathology practice includes semiquantitative assessment of interstitial fibrosis and tubular atrophy. Therefore, interstitial fibrosis and tubular atrophy scoring was not preceded by webinar training and was performed only once by six pathologists.

Podocyte descriptor concordance tests

Although ultrastructural evaluation of podocyte morphology is common in pathology practice, estimates of some ultrastructural parameters are often not reported. Thus, the podocyte descriptor test was preceded by a webinar session to review definitions reflecting effacement, condensation of actin-based cytoskeleton, microvillous transformation, and loss of primary processes. Ultrastructural podocyte descriptors were scored by five pathologists with 1 to >10 years of experience on 178 cases (minimal change disease/focal segmental glomerulosclerosis) as follows: foot process effacement: 0=1–10%, 1+=11–25%, 2+=26–50%, 3+=51–75%, and 4+=>75%; condensation of actin-based cytoskeleton and microvillous transformation: 0=not observed, 1+=segmental (≤50%), 2+=global (>50%); loss of primary processes was scored as absent (0) or present (1+).

Statistical Methods

For the (dichotomous) glomerular descriptors, intra-reader agreement was assessed by both Cohen's kappa and pathologist-specific counts of the number of descriptors that the pathologist rated the same way in two consecutive readings. Inter-reader agreement between pairs of pathologists was also estimated using Cohen's kappa, but Fleiss' kappa23 was used to estimate inter-reader agreement pooled across all pathologists. The variability in kappa values across pairs of pathologists for each descriptor is shown using boxplots.

Scoring was also performed for clusters of glomerular descriptors sharing morphologic similarities. A cluster was judged to be present if at least one descriptor of the cluster was present, and Fleiss' kappa was used to assess pooled inter-reader agreement among pathologists. The kappa statistic ranges from −1 (perfect disagreement) to 1 (perfect agreement), with a value of 0 indicating agreement expected by chance alone. Kappa statistics were categorized and interpreted as: >0.80 (excellent); 0.61–0.80 (good); 0.41–0.60 (moderate); 0.21–0.40 (fair); 0–0.20 (poor); and <0 (no agreement) (http://healthcare-economist.com/2011/11/02/kappa-statistic). Because kappa is smaller with lower prevalence of the finding under observation, we report the range over pathologists of the number of glomeruli in which each descriptor was observed. Although we calculated kappa statistics for all descriptors with at least one pathologist rating, some results exclude descriptors with insufficient observations, defined as the maximum over all pathologists of the number of glomeruli in which the descriptor was observed being less than five.

We investigated the four research questions listed above as follows: (a) to assess whether inter-reader concordance could improve with cross-training we evaluated the number of descriptors that increased in concordance between Tests I and II, and between Tests I and III; (b) to assess whether inter-reader concordance depended on the pathologist's years of experience, we compared the kappas from all pathologists with the kappas excluding the trainees; (c) to assess the effect of scoring descriptor clusters, we visually compared cluster concordance with individual concordance estimates for each descriptor in the cluster; and d) to assess whether concordance was maintained outside NEPTUNE investigators we compared concordance among the three non-NEPTUNE pathologists and among the seven NEPTUNE pathologists using the 315 glomerular images from Test II. The Neptune and non-Neptune summary kappas were compared by paired t-test.

For the continuous interstitial fibrosis and tubular atrophy scores, inter-reader agreement was estimated using Pearson's correlation coefficient on all pairs of pathologists (the pathologist with more vs less years of experience). For the ordinal podocyte descriptors, Kendall's coefficient of concordance was used to assess inter-reader agreement for pairs of pathologists.

Results

Intra-reader Concordance for Glomerular Descriptors

When comparing glomerular intra-reader concordance Test I vs Test II and Test II vs Test III, the average intra-reader concordance for glomerular descriptors increased with cross-training/consensus webinars. (Supplementary Table 2 and Supplementary Table 3). When comparing glomerular concordance Test II versus Test III, there were four descriptors for which all pairs of readers had good concordance, and 11 descriptors where all pairs had at least moderate concordance (Supplementary Table 3). Interestingly, inconsistent intra-reader concordance was noted for lesions of segmental sclerosis corresponding to 'perihilar' and 'not otherwise specified' variants of the Columbia classification.24 At least moderate intra-reader agreement was found for most of the descriptors commonly associated with segmental sclerosis or collapse, such as various form of hyalinosis, podocyte hypertrophy, foam cells or periglomerular fibrosis. Unexpected inconsistency in intra-reader agreement was noted for basic lesions such as global sclerosis, although other forms of global damage (obsolescence, global collapse, deflation and spikes) were more consistently recognized.

Inter-reader Concordance for Glomerular Descriptors

For the 315 glomeruli (Test II), 48/51 glomerular descriptors had sufficient data for evaluation. The kappa statistics from the combined NEPTUNE and non-NEPTUNE pathologists represent our current best summaries of this investigation. Based on these results, 8/48 descriptors had good inter-reader concordance; these included descriptors indicating global lesions (global spikes, deflation, collapse, and obsolescence) and segmental lesions (foam cells, cellular tip lesion, segmental deflation necrosis). An additional 17/48 descriptors had moderate concordance for a total of 52% of descriptors tested having an inter-reader Cohen’s kappa ≥0.40 (Table 3). Concordance between pairs of pathologists varied widely by pair and by descriptor, but most had moderate or better concordance (Figure 1a and b).

Figure 1
figure 1

For each of 51 descriptors, an inter-reader kappa statistic was calculated for each of the 45 pairs of 10 pathologists. The boxplots show the distributions of these sets of 45 kappa statistics. (a) glomerular descriptors indicating sclerosis or associated with sclerosis.29 (b) All other glomerular descriptors.22

The overall inter-reader concordance increased with cross-training from Test I through Test III among NEPTUNE pathologists in the set of 131 glomerular images. Of the 51 glomerular descriptors tested, 19 were not sufficiently represented to evaluate inter-reader concordance. For 32 descriptors with sufficient data for comparison, 56% had improved kappas between Tests I and II, and 63% between Tests I and III. Five descriptors improved the initial kappas of moderate to good or excellent, including global lesions (such as global deflation), segmental lesions (mid-glomerular segmental sclerosis and hyalinosis at the vascular pole), and the descriptor indicating no abnormalities. An additional three descriptors (cellular non-tip lesions, periglomerular fibrosis, and global podocyte hyperplasia) increased performance from fair/poor to moderate or good. (Table 1, Table 2, Figure 1a and b).

Table 2 Inter-reader concordance of all 51 glomerular descriptors by Test (I, II, and III) and NEPTUNE/non-NEPTUNE affiliation

As expected, better concordance was achieved in most cases by clustering descriptors together. Compared with the cluster kappas, most component kappas are substantially smaller. However, for five of the clusters, a single component kappa was larger than the cluster kappa, showing that clustering often, but not always, leads to optimum concordance. Concordance improved when selected descriptors for sclerosing/obliterating lesions or for epithelial cell (podocytes) damage were combined (Table 3).

Table 3 Inter-reader agreement (Cohen's kappa) of NEPTUNE pathologists for clusters of glomerular descriptors assessed in Test I, II, and III (131 glomeruli)

Concordance was independent of years of experience; analysis excluding the data generated by the trainees did not change significantly the overall concordance (data not shown). NEPTUNE and non-NEPTUNE pathologists had comparable overall inter-reader kappas (mean difference between kappas=0.015, paired t-test P=0.502).

Inter-reader Concordance for Tubulointerstitial Parameters

Excellent concordance was seen for both interstitial fibrosis and tubular atrophy, independent of years of experience (Figure 2; Figure 3d and e; Supplementary Table 4). In addition, overall concordance for interstitial fibrosis and tubular atrophy scoring remained consistently excellent when analyzed separately for each disease (minimal change disease, focal segmental glomerulosclerosis, and membranous nephropathy; data not shown).

Figure 2
figure 2

For each of the interstitial fibrosis and tubular atrophy descriptors, Pearson correlation coefficients were calculated for all 10 pairs of 5 pathologists on 244 cases. The boxplots show the distributions of these sets of correlation coefficients.

Figure 3
figure 3

Glomerular and tubulointerstitial histologic descriptors: (a) A segmental obliteration of the glomerular tuft at the tip of the glomerulus is here represented. Additional descriptors applicable to this glomerular image are segmental epithelial cell hypertrophy, halo, adhesion, and global mesangial cell hypercellularity (Hematoxilin & Eosin). (b) The glomerulus here represented is morphologically profiled by segmental epithelial cell (podocyte) hypertrophy and hyperplasia, hyaline droplets and segmental collapse (Silver). (c) Foam cells, adhesion and hyalinosis are noted in the absence of increase matrix (sclerosis) (Hematoxilin & Eosin). (d) Trichrome stain showing increased deposition of collagen in between the tubules (interstitial fibrosis) and thickening on the tubular basement membranes in atrophic tubules (tubular atrophy). (e) Trichrome stain showing small tubules with thickened tubular basement membranes lined by small cuboidal epithelial cells. Ultrastructural Podocyte descriptors. (f) In this electron micrograph there is extensive foot process effacement with loss or widening of individual foot processes. (g) Aberrant formation of numerous slender cellular projections resembling microvilli or vesicle-like structures is seen along the apical surface of podocytes. (h) Condensation of electron-dense filaments against the sole of podocytes is present. (i) Loss of primary processes was recorded when epithelial (podocyte) cell bodies were in direct contact with glomerular basement membranes without interposition of primary processes.

Inter-reader Concordance for Podocyte Descriptors

Concordance was excellent for foot process effacement and good for microvillous transformation and condensation of the actin cytoskeleton, and moderate for loss of primary processes. (Figures 3fi and 4; Supplementary Table 5).

Figure 4
figure 4

For each of the podocyte descriptors, pairwise Kendall’s coefficient of concordance was calculated for all 10 pairs of five pathologists on 178 cases. The boxplots show the distributions of these sets of correlation coefficients.

Descriptor Reference Manual Revision

At the end of the study the descriptor reference manual was revised during several consensus webinar sessions that included NEPTUNE pathologists as well as pathologists outside the consortium, and language was added to improve clarity of definitions (Supplementary Table 6).

Discussion

To take advantage of and coordinate with new findings being discovered in molecular nephrology, renal pathologists must identify methodologies and approaches that allow for better integration of morphologic evaluation creating more compelling diagnostic paradigms.14 Furthermore, it is critical to design and implement classification systems for clinical research that are more meaningful with regard to novel renal biomarkers, prognosis, and treatment approaches.1 The use of such morphologic observations requires concordance of pathologic analysis across diseases, level of training and experience. One goal of the NEPTUNE consortium is to identify reproducible morphologic variables that can be implemented in clinical practice by creating a new taxonomy of renal diseases. Toward that goal, we carried out a study testing intra- and inter-pathologist concordance using a set of 51 glomerular, two tubulointerstitial and four ultrastructural features.

The first critical step toward a robust morphologic evaluation was the establishment of well defined morphologic criteria documented in a reference manual. The NEPTUNE digital pathology scoring system reference manual is comprehensive of features included by other classification systems and we referred to previously published criteria for some of the descriptors;5, 25 however, many of the descriptors listed in the NEPTUNE digital pathology scoring system, although used in clinical practice to some degree, were not thoroughly defined by consensus and organized in a comprehensive reference manual prior this study.

An innovative contribution of this study is the development of a protocol exploiting digital pathology technology. The introduction of digital pathology into large-scale glomerular disease research has enabled simultaneous remote access of multiple users.1, 11, 13, 17, 26, 27 The application of digital technology, and of software for annotation of glomeruli, offers the opportunity to systematically eliminate glomerular selection bias, providing the basis for potentially higher inter-observer concordance.12 Although it is intuitive that there are minimal differences in concordance when scoring interstitial fibrosis and tubular atrophy by conventional light microscopy or whole-slide images, recognizing the value of specifically selecting structures to be evaluated, a recent concordance study was conducted using single digital images of glomeruli to identify the five patterns of focal segmental glomerulosclerosis (Columbia classification). This strategy, eliminating the glomerular selection bias, resulted in an overall good agreement among the six pathologists.16 In our study, we partially mimic the strategy utilized by Meehan et al16 by capturing digital images of individual annotated glomeruli from the whole-slide images of the 400 cases stored in the NEPTUNE digital pathology repository. By controlling the modality of the image review, the observations made, while under the control of the pathologist, were consistent with regard to image quality and to some extent magnification between reviewers. Using this approach, we were able to apply an 'object oriented' evaluation of performance, rather than a specimen-based approach.

Concordance of individual descriptors and factors contributing to concordance: most concordance studies are based on a one-time assessment. In our study, we demonstrated that concordance is modifiable by cross-training over time. This approach was tested in a study on thymic epithelial neoplasms, and resulted in post-webinar training improved concordance, confirming the value of digital pathology as an educational tool.27 Although the inter-reader discrepancies in our study may appear significant, the total number of parameters involved for which pathologists needed cross-training, compared with a single diagnosis of epithelial neoplasia in the Wang’s study, was much greater. Intra-reader concordance also improved with cross-training and webinar-based consensus as more detailed and objective criteria were provided to the participants, lessening individual reluctance in changing internal/subjective criteria. Thus, we still consider our observations encouraging for the systematic application of webinar cross-training to increase intra- and inter-reader concordance.

The best performance was obtained by the interstitial fibrosis and tubular atrophy score, with overall excellent inter-reader concordance despite the lack of previous webinar training. Similar high concordance was obtained in the Oxford classification study.5 We hypothesize that this excellent performance is a consequence of the routine scoring of interstitial fibrosis and tubular atrophy in renal biopsy practice. Similarly, concordance was proportional to the frequency the ultrastructural podocyte descriptors are used in routine renal pathology assessment of biopsies; the highest concordance was recorded for the most commonly used parameter (foot process effacement) and the lowest for the descriptor used only experimentally (loss of primary process).28 These data raise the question of whether descriptors for which familiarity and training are inadequate should be used and included in future studies. Developing robust training tools and metrics of performance is critical, as these infrequently assessed lesions may demonstrate correlation with clinical or molecular parameters and may add value to morphologic analysis or classifications. The continuous cross-training approach may ultimately prevent future classification systems from excluding morphologic criteria initially not performing well, but that may still have great potential as predictors of outcome. This concept may alter the current approach to generating classifications, which currently select for morphologic features based on concordance, to including initially less reproducible but valuable observational data by introducing post-training amendment and adjustment options. Should this occur, greater use of such features in routine clinical practice would then increase familiarity and automatically improve concordance.

The uneven level of concordance of some glomerular histologic descriptors is not easily explained. Although we eliminated the glomerular selection bias and provided a prefilled electronic scoring sheet listing all possible descriptors, lack of reproducibility for some descriptors may derive by failure to see or forgetting to mark a specific lesion among others affecting the same glomerulus, whereas for descriptors that are present in isolation, such as global spikes, it may have been easier to maintain the focus. Whereas global collapse or capillary wall spikes had expected high concordance, variable concordance was observed for subtypes of global or segmental sclerosis, although when consolidated under global or segmental obliteration, overall performance increased. The high concordance of segmental obliteration as an overall category confirms the data obtained by the Oxford classification study, where segmental sclerosis was defined as solidification/obliteration involving any part of the tuft and not broken down in subtypes based on location or cellularity.5 The lack of consistency in recognizing the type of segmental sclerosis may appear to challenge the value of the conventional classification system of focal segmental glomerulosclerosis.25 While low concordance is experienced when using individual descriptors defining the subtypes of segmental sclerosis, the application of the Columbia classification system at the glomerular level may have better concordance.16 The paradox that summary diagnostic approaches, rather than lesion-driven diagnostic paradigms, have better performance in concordance studies suggest that pathologists are using the totality of the histopathology to arrive at a diagnosis. This 'holistic' approach may be diagnostically powerful, but may limit prognostic utility, which is better elucidated by feature-based criteria. In addition, although all participants recognized epithelial cell (podocyte) injury, there were features that were inconsistently identified across reviewers, with the greatest difficulty in differentiating segmental vs global lesions and hyperplasia vs hypertrophy. When segmental and global or hypertrophy and hyperplasia were combined, concordance increased. Good concordance was obtained by combining all podocyte abnormalities. It also appeared that the challenge in identifying segmental vs global lesions is not limited to podocytes but also applicable to mesangial cell proliferation. Again, by combining segmental and global mesangial cell proliferation, the kappa coefficient increased in the 315 glomeruli study to 0.64, confirming that the overall mesangial cell proliferation has adequate concordance to be included in classification systems.5 The poor concordance of these features suggests that they require additional refinement and evaluation before inclusion in classification systems where, for example, the recognition of segmental vs global damage/proliferation may drive therapeutic choices.29 Additional studies, currently in process, have been developed with the goals of (a) re-testing this approach provided more training, (b) testing reproducibility in the context of a European-based (EURenOmics) and Chinese-based (NEPTUNE-China) study by a different set of reviewing pathologists applying the NEPTUNE digital pathology scoring system, (c) testing all NEPTUNE descriptors using different metrics (for example continuous vs dichotomous), and (d) applying other statistical methods.

When comparing data from NEPTUNE pathologists after several training sessions to non-NEPTUNE pathologists, the overall concordance was in favor of the NEPTUNE pathologists, although on 315 glomeruli the number of descriptors with a good or excellent concordance was greater for non-NEPTUNE pathologists. Several factors may have contributed to this result, including variability among pathologists.

This study also addressed whether concordance depended on years of experience in clinical practice. The overall coefficient of concordance did not change with the exclusion of pathologists in training. Trainees are accustomed to individual feature recognition as part of the learning process compared to experienced pathologists who are used to pattern recognition summarizing individual features into a diagnosis line.

After post-study revision of the reference manual to add clarity to the descriptor definitions (Supplementary Table 6), the NEPTUNE digital pathology scoring system and protocol were shared and implemented by other multicenter consortia with the generation of an INTEGRATE (INTErnational diGital nephRopAThology nEtwork) between pathologists from North America (NEPTUNE), Europe (EURenOmics) and Asia (China-DiKip).

In conclusion, the NEPTUNE digital pathology scoring system provides comprehensive analysis of renal structures with good-to-excellent concordance for many parameters. Although previous classification systems have eliminated poorly performing descriptors,5 here we provide an alternative model that maintains the original scoring metrics, but applies summary measures of clustered features and recommends continuing cross/training and consensus meetings. As metrics should ultimately be measured against their contribution to outcome and to guiding therapy, the rationale in favor of improving performance in contrast to dropping descriptors is that these descriptors have potential for important clinical value. Thus, this novel protocol for continuous improvement may serve as a model with potential to modify current classification systems, applicable across multiple international consortia, enabling world-wide collaboration and compilation of permanently recordable granular observational data suitable for correlation with clinical and molecular profiling of glomerular diseases.