Stability of diagnostic rate in a cohort of 38,813 colorectal polyp specimens and implications for histomorphology and statistical process control

This work sought to quantify pathologists’ diagnostic bias over time in their evaluation of colorectal polyps to assess how this may impact the utility of statistical process control (SPC). All colorectal polyp specimens(CRPS) for 2011–2017 in a region were categorized using a validated free text string matching algorithm. Pathologist diagnostic rates (PDRs) for high grade dysplasia (HGD), tubular adenoma (TA_ad), villous morphology (TVA + VA), sessile serrated adenoma (SSA) and hyperplastic polyp (HP), were assessed (1) for each pathologist in yearly intervals with control charts (CCs), and (2) with a generalized linear model (GLM). The study included 64,115 CRPS. Fifteen pathologists each interpreted > 150 CRPS/year in all years and together diagnosed 38,813. The number of pathologists (of 15) with zero or one (p < 0.05) outlier in seven years, compared to their overall PDR, was 13, 9, 9, 5 and 9 for HGD, TVA + VA, TA_ad, HP and SSA respectively. The GLM confirmed, for the subset where pathologists/endoscopists saw > 600 CRPS each(total 52,760 CRPS), that pathologist, endoscopist, anatomical location and year were all strongly correlated (all p < 0.0001) with the diagnosis. The moderate PDR stability over time supports the hypothesis that diagnostic rates are amendable to calibration via SPC and outcome data.

1. one of the following words: "colon", "rectum", "rectal", "cecum", "cecal", "rectosigmoid" in the "source of specimen" section of the report 2. "polyp" within the "source of specimen" section of the report The "source of specimen" section is what the endoscopist labels the specimens as. It was chosen as it was deemed to have the most uniformity of the report sections.
It should be noted the "specimens" correspond to bottles/containers submitted. One specimen may in fact contain zero, one or several polyps that are from one or more anatomical sites. Several specimens may be derived from one surgical case; quantification of this is not part of the study. A number of surgical cases may originate from one individual; however, quantification of this is not part of the study.
Retrieved specimens were written to a tab separated file which was then further processed to replace the surgical case number, submitting physicians and pathologists with unique anonymous identifiers. Control charts for individual pathologists. Each panel (control chart) in this figure is one diagnosis (e.g. SSA) for a pathologist that read > 150 CRPS/year in the seven years of the study. The black circles represent the normalized PDR in different years; the denominator within the normalized PDR is the number of colorectal polyp specimens read by the individual pathologist in the year. The thick solid black line is the pathologist's average (or mean) call rate (PACR). The dashed blue control lines above and below the PACR are 2 standard deviations (SDs) from the PACR; points outside the range are p < 0.05. The solid blue control lines represent 3 SDs from the PACR; points outside the range are p < 0.003. The outer finely dashed blue control lines above/ below the PACR are 5 SDs and 7 SDs from the PACR respectively; points above/below the lines are p < 6e−7 and p < 3e−12 respectively. Controls lines may be absent if the PACR is close to zero (as negative normalized PDRs do not have a physical interpretation). The individual panels (control charts) are (A) high-grade dysplasia (HGD) showing the "in control" condition; (B) tubulovillous adenoma + villous adenoma (TVA + VA) showing the "in control" condition; (C) sessile serrated adenoma (SSA) showing the "in control" condition; (D) hyperplastic polyp (HP) with a "blip" (> 2 SDs/P < 0.05 outlier) in year two of the study; (E) HGD with a "blip" (P < 0.003 outlier) in year one of the study; (F) HGD with a "trend" (non-significant, not crossing control lines); (G) SSA with a "trend" (significant, crossing control lines); (H) TVA + VA with a "trend" (significant, crossing control lines); (I) HGD with a "step" (significant, crossing control lines); (j) TVA + VA with a "step" (significant, crossing control lines). www.nature.com/scientificreports/ Specimens were then tabulated. Cases were classified by fuzzy string matching using an open source library called google-diff-patch-match and several dictionaries of terms into: 1. one or more 40 diagnostic categories (based on 194 phrases) -see "Appendix A" 2. one or more 12 anatomical locations (based on 18 phrases or 9 measurement cut points) -see "Appendix B" Audits were done with randomly selected CRPS to assess the accuracy of the computer's classification. This involved pathologists comparing the (pathology) report free text with the diagnostic categories (listed in "Appendix A" and "Appendix B") assigned by the computer.
After categorization and tabulation, the anonymized data set was further processed by a custom GNU/ Octave 8 program to create funnel plots and control charts. Funnel plots that included data from all pathologists were centred on the group median diagnostic rate (GMDR). The GMDR was chosen, as the reference, as it is (1) not influenced by significant outliers, and (2) not biased by case volume. The funnel edges were defined by two and three standard deviations from the GMDR and calculated via the normal approximation of the binomial distribution as previously described 9 . Control charts (equivalent to the funnel plots) were created by normalizing to the number of cases read by the highest volume pathologist in the group; details of the normalization are within "Appendix C" 15 . Normalization was done to obscure case volume and facilitate ease of interpretation.
Pathologist-specific control charts (showing the year-to-year variation) were created with the individual pathologist's mean diagnostic rate, if the pathologist interpreted at least 600 specimens. The mean was chosen as the number of cases per year was not equal; using the mean ensured that the cases had equal weight in determining the control chart "centre". Data points for a given year were plotted only if the pathologist interpreted at least 150 specimens in that year. The thresholds (600 specimens, 150 specimens/year) were chosen to ensure that the PDR estimates are have relatively narrow confidence intervals. www.nature.com/scientificreports/ Generalized linear models, with a random intercept for each hospital, were utilized to estimate the association between independent variables (pathologist, submitting MD, anatomical location, and year) and high-grade dysplasia (HGD), villous component (TVA + VA), hyperplastic polyp (HP), tubular adenoma (TA), and sessile serrated adenoma (SSA). These models were implemented using SAS version 9.4 (SAS Institute, Cary, NC).
Prior to this calculation, all pathologists and all submitting physicians interpreting or submitting less than 600 specimens were excluded from the dataset. Uncommon nonspecific/vague anatomical sites (e.g. "anastomosis [not otherwise specified]" or "left colon [not otherwise specified]") were also excluded from the data set to avoid the possibility of over-fitting.
Ethics approval and consent to participate. Ethics approval was obtained. Consent for publication is not applicable.

Results
In the study period, the program extracted 64,115 colorectal polyp specimens. A small number of polyp specimens (< 1%) may not have been captured, as we previously did an analysis on this (published in abstract form 9 in a cohort of 11,457 large bowel polyp specimens, 68 surgical cases could not be parsed (separated into parts/ specimens). The 68 cases (not parsed by the computer) were examined in detail and it was determined that 37 had unusual report formatting (e.g. parts were out of order), 24 had a mislabelled part (e.g. "Part D" transcribed as "Part P"), 7 had missing specimen parts (e.g. requisition has Parts A-C, diagnosis sections has Part A-B (Part C is absent)).
In the 64,115 colorectal polyp specimens that were retrieved, the hierarchical free text string matching algorithm (HFTSMA) could classify 63,050 of the specimens with regard to a diagnosis, and 63,508 with regard to the anatomical site. Several individuals independently assessed the accuracy of the computer's classification via random audits in at least 789 specimens. Prior audits suggested that the overall accuracy is ~ 97%.
The three percent that is not classified correctly is mostly not classifiable; we previously analyzed 55 of 92 unclassified colorectal polyp specimens in a cohort of 11,457 large bowel polyp specimens 9 . In most cases, the failure was nontechnical/unrelated to the HFTSMA; 19 cases were rare/descriptive diagnoses, 24 vaguely worded diagnoses, 7 failed due to (unusual) report formatting/transcription and 5 failed for an unknown reason.
Since the custom analysis programs have evolved in the past 2 years, we did a further random audit of the computer's classification. Four hundred polyp specimens were selected at random and the computer-generated diagnostic codes were compared to the text of the diagnosis. In this analysis, 394 cases were correctly classified and 6 not coded; this matched our prior experience. We also recently examined sessile serrated adenomas (SSA) over multiple years in a subset of ~ 7000 colorectal polyp specimens 10 . In that context, the accuracy of SSA classification was examined; in 400 randomly selected cases there were zero errors in the classification of SSA/ not SSA. Report auditing (based on the results) found systematic misclassifications in HGD and TA; these were corrected by adjusting the dictionary of diagnostic terms and re-running the analysis.
Outliers > 7 SDs from the GMDR (seen in SSA, HP and TVA + VA) prompted reviews of 100-200 randomly selected specimen reports for each of the anonymous outlier pathologists, and these confirmed that there is no www.nature.com/scientificreports/ significant categorization error (due to unusual reporting language) from the HFTSMA that could explain the observed diagnostic rates. An overview of the colorectal polyp cohort is found within Table S1 (see supplemental materials). ' Adenoma [not otherwise specified]' was combined with 'tubular adenoma' , as these appeared to be used as synonyms by a subset of pathologists.
The control chart showed various patterns. The "in control" pattern was common, and is the expected result if (1) the individual pathologist has not changed their practice, (2) the population disease rates are stable. Representative control charts of this type are seen in Fig. 1A-C.
Some control charts (e.g. Figure 1D,E) showed an outlier in the background of what would otherwise be "in control"; this was the most common pattern (See Table 2). A third type of chart shows a pattern (increasing or decreasing) with or without crossing control lines (e.g. Fig. 1F-H). A fourth type of chart shows a step (upward or downward) with relative stability before and afterward (e.g. Fig. 1I,J).
The control charts constructed around the pathologist's mean PDR are summarized in Table 1a.
The control charts, centred on the group median diagnostic rate (GMDR), showed many outliers (see Fig. 2A-E), and are summarized in Table 3a. Outliers (p < 0.05) were calculated using the GMDR for each of the hospitals. The results are in Table 3c; specimens from two hospitals are effectively shared by one group of pathologists; thus, this was considered one site for the purpose of the control chart analysis ( Table 2).
The control charts based on the individual pathologist's mean PDR and those based on the GMDR are not directly comparable; however, the summary data (Tables 1a and 3a) does allow some comparison. The fraction of outliers (shown in Tables 1b, 3b,d) were calculated using the total number of elements-105 and 27 respectively. These tables show that there are proportionally less outliers when the data is plotted by the pathologist, suggesting the individual pathologist is a very strong predictor-a result demonstrated with logistic regression. For example, the fraction > 2 SD for HGD is 0.09, 0.41 and 0.59 for comparison to self (Table 1a), comparison to hospital site (Table 3b) and comparison to the group of 27 pathologists (Table 3d) respectively; this is also shown in Table S2 (see supplemental materials).
The outlier frequencies within Table 1a (with the exception of HGD) are highly improbable to be only a consequence of sampling. The cumulative probability of being outside two standard deviations for (1) the number of outliers and (2) all greater number of outliers for HGD is p ~ 0.08 (see "Appendix D" for details). The probability of being outside two standard deviations (for (1) the number of outliers and (2) all greater number of outliers) for all the other diagnoses is p < 0.0001. The outlier frequencies (for two standard deviations) in Table 3a are all significantly in excess of that expected due to sampling. Table 2 shows the number of pathologists by the number of outlier years for two standard deviations. Stated differently, Table 2 is a tabulation of the 105 control charts; the question answered is: how many > 2 SD outliers do each of the 15 pathologists have for a given diagnosis? Fig. 1A shows one of the 8 pathologists that had zero HGD outliers (all circles between the two dashed blue control lines). Figure 1C is one of the two pathologists that had zero SSA outliers. The outliers-years found in Table 1a, are related to the numbers in Table 2; in Table 2 for HGD: 5 pathologists with 1 outlier year each + 2 pathologists with 2 outlier years each = 9 pathologist-year outliers (> 2 SD) in Table 1a. Table 2 shows that there is good self-consistency for HGD; eight pathologists had zero outlier years. It also shows that SSA had marked changes; two pathologists had six outlier years (one of these two pathologist's normalized PDRs are shown in Fig. 1G).
The random effects models (see Table 4) demonstrated that the pathologist, submitting MD, and anatomical location are all strong predictors (p < 0.0001) of histomorphologic diagnosis of TA, HGD, TVA + VA, HP, and SSA.

Discussion
The HFTSMA algorithm appears to deliver reliable categorizations that are. sufficient to assess diagnostic variances on the order of 1%. Non-categorized polyps appear to represent a separate group/set of diagnoses that are predominantly descriptive diagnoses or ambiguously-worded reports that cannot be easily classified. Control charts showing the normalized pathologist diagnostic rates (PDRs) for the 27 pathologists reading > 600 CRPS in the seven year study period. Each panel (control chart) in this figure is one diagnosis, e.g. SSA. The different markers (red circles, blue Xs, black boxes) represent individual pathologists from different hospitals. The solid black line is the group median diagnostic rate (GMDR). The dashed blue control lines above and below the GMDR are 2 standard deviations (SDs) from the GDMR; pathologists outside the inner funnel are statistically different than the GMDR (p < 0.05). The solid blue control lines above and below the GMDR represent 3 SDs; pathologist outside the outer funnel are statistically different than the GMDR (p < 0.003). The outer finely dashed blue control lines above/below the GMDR are 5 SDs and 7 SDs from the GMDR respectively; points above/below the lines are p < 6e−7 and p < 3e−12 respectively. Controls lines may be absent if the GMDR is close to zero (as negative normalized PDRs do not have a physical interpretation).  www.nature.com/scientificreports/ Comparison of pathologists to self. Number of pathologists by the number of (> 2 standard deviation) outliers in relation to each pathologist's mean call rate over the seven-year period. This tabulation shows the number of 2 SD outliers they had for each diagnosis, e.g. 8 pathologists had zero outlier years for HGD, 1 pathologist had 5 outlier years for TVA + VA (this is shown in Fig. 1H), 2 pathologists had 4 outlier years for SSA, 2 pathologists had 6 outlier years for SSA (one of the two pathologists is shown in Fig. 1G). HGD high-grade dysplasia, HP hyperplastic polyp, SSA sessile serrated adenoma, TVA+VA tubulovillous adenoma + villous adenoma, TA_ad tubular adenoma + adenoma NOS, P number of pathologists.  www.nature.com/scientificreports/ Most of the pathologists in the cohort had relatively stable diagnostic rates over time, but there were apparent outliers. The relative uniformity in some diagnoses (e.g. high-grade dysplasia) provides good evidence against the presence of case assignment bias.
The hospital sites show some clustering of patterns in PDR. This may be mostly explained by the presence of group set-point bias rather than true differences between hospital sites.
The "clinicians" factor (submitting MD) appears to explain less variation in the data than the "pathologist" factor.
Traditional inter-rater studies look at a relatively small set of cases and rarely examine diagnostic bias over a longer period of time. This study examined the reports in an entire region over a period of seven years.
While high-grade dysplasia and villous component are predictive of neoplasia risk in large cohorts, the findings herein suggest risk stratification using high-grade dysplasia and villous component suboptimally riskstratifies individual patients, due to the consistent (presumptive substantial inter-rater) variation in the pathologist diagnostic rate.
Generally, the findings demonstrate that the histomorphologic interpretation of colorectal polyps could be less varied than seen herein, and imply that (statistical) process control (or an automated analysis), that reproduces the categorization biases of one pathologist (or a panel of pathologists), would deliver more uniformity.
Based on our prior work 11 and PDR data (predominantly published as conference abstracts), we are not convinced that more sub-specialization is the only answer. We also note that disagreement among subspecialists may be very high 12 . The processes changes (independent of training) may significantly improve quality 13,14 . Limitations. A few pathologists moved between hospital sites in the 7-year period; however, none of the 15 pathologists interpreted less than 92% of their specimens from their primary site. This is a confounder that was not specifically controlled for in the construction of the control charts; however, the effect is suspected to be small.
It is not possible to determine the ideal rate(s) in this study. Whether large true differences exist between the hospital sites cannot be determined within the context of this study. It is possible that the differences between the Table 4. Generalized linear model results. "PATHOLOGIST" and "CLINICIAN" are variables that represent individual pathologists and individual submitting physicians/surgeons. "LOC_FULL_CR" is a variable that reperesent the anatomical location; it can be one of nine locations in the colon/rectum (rectum, rectosigmoid colon, sigmoid colon, descending colon, splenic flexure of colon, transverse colon, hepatic flexure of colon, ascending colon, cecum). "YEAR_VAR" is the year in which the specimen was accessioned. "DF" is the degrees of freedom. Tubular adenoma and adenoma NOS (TA_ad) were lumped in this analysis, as a subset of pathologists (early in the study period) signed cases as "adenoma" without further specifying. These are presumed to represent tubular adenomas. www.nature.com/scientificreports/ hospital sites is totally or partially explained by group bias. A significant number of specimens (~ 300 from each hospital site) would need to be reviewed by an expert panel, as the differences are likely to be small. Normalized plots showing the polyps by lumped anatomical site (left colon, mid colon, right colon) and pathologist suggest there may be differences between the hospitals (see supplemental materials). Based on how the specimens are submitted and reported in routine practice, it is not possible to do the analysis on the level of the individual polyp.
We did not attempt to make control charts based on the yearly rates for each hospital, as the study set (15 pathologists with data over all years) was deemed to be too small to sub-stratify. This true limitation was explored with the random effects model and logistic regression.
Significant changes over time were identified with the random effect model, thus calling into question the "disease stability" assumption that is a part of the control chart analysis. We are not convinced these changes affect the overall conclusions due to the variation seen in the control charts. It is impossible to determine whether the change over time is (1) diagnostic re-calibration/diagnostic drift by selected pathologists; (2) a change in the population or (3) some combination of drift and population change. The trend data suggests strongly that there is re-calibration. We suspect there was a shift between TA and HP in the population. Supplemental materials show how the diagnoses varied over the seven-year period. There are very clear trends in HP and SSA, which may be rationalized in the context of when SSA was described, and by knowledge dissemination rates in medicine.
Specific healthcare provider characteristics (e.g. training, years in practice, type of practice) were not collected as part of this study. These may be significant predictors.
The study is observational and the data collected is influenced to certain extent by conscious changes to clinical practice. A subgroup of pathologists (due to a quality improvement project/pilot study 15 were aware of their diagnostic rates in the last two years of the study period and a subset of those adjusted their practice. This likely decreased consistency with self and thus somewhat decreased the study's effect size. It is not possible to fully analyze the effect of the subgroup (due the anonymity constraint in the study); however, the control charts show diagnostic rate changes in the early part of the study that are similar in magnitude to changes in the later part of the study period; thus, the overall conclusions are likely unaffected.
Diagnostic rate awareness and improvement. Colorectal polyps are specimens that may be infrequently reviewed at consensus rounds in relation to their volume; thus, call rate harmonization/calibration that takes place within pathology practices may not occur for these specimens. Also, random case reviews are not powered to detect modest call rate differences and would be prohibitively expensive if powered to do so.
SPC is a mechanism that may facilitate greater uniformity in reporting practices through greater dialogue about true population rates (with ideal pathological interpretation), and promote continuous review of outcome data. In the presence of significant differences in interpretation (that are unlikely to result from case assignment/ sampling), suboptimal interpretations may be suspected and a process of resolution implemented through consensus guided by outcome data.
Based on a pilot study of ~ 7054 colorectal polyp specimens (interpreted by 9 pathologists (each year) over two years-Sep 2015-Aug 2017) in conjunction with (1) informing each pathologist of their (TA, HP, SSA, TVA + VA) diagnostic rates, and (2) a group review of (SSA cases) with a gastrointestinal pathology expert, it is possible to increase uniformity in sessile serrated adenoma (SSA) diagnostic rates 15 .We suspect that this process (statistical process control) could be applied more broadly and would lead to further improvements.

Conclusions
Current diagnostic processes for colorectal polyp specimens leave significant room for further improvement. This work suggests that most pathologists have diagnostic rate stability, and that non-stable rates are likely (conscious or unconscious) practice changes.
Statistical process control (SPC) could result in significantly more uniformity, given that many pathologists have moderate diagnostic rate stability. Thus, the further implementation of SPC in pathology should be pursued, as it could substantially optimize and improve care.

Data availability
The data sets generated and/or analyzed during the current study are not publicly available due confidentially reasons but aggregate data is available from the corresponding author on request www.nature.com/scientificreports/ n j normed = normed number of specimens handled (interpreted) by healthcare provider "j". n j measured = number of specimens handled (interpreted) by healthcare provider "j". i = ideal (diagnostic) rate. www.nature.com/scientificreports/ *** If the variation is less than the variation by chance the process is in control (or one may need a larger sample size).

.Appendix A (Diagnostic Codes and Search Strings)
The conditions for statistical process control and the objective of the manuscript. 'Condition 1' for SPC is met if there is diagnostic stability.
'Condition 2' for SPC is met if there is significant diagnostic variation-that is stable (e.g. one pathologist is a consistent outlier in relation to the median diagnostic rate *****), and it is assumed that pathologists want to improve their practice/can be encouraged to make positive changes (see section "Diagnostic Calibration is Not New").
'Condition 1' and 'Condition 2' are sufficient to infer that SPC should be feasible and could be used to improve care. ***** It should be noted that: the 'median diagnostic rate' may not be the ideal diagnostic rate for a given population. It is possible that an 'outlier' pathologist represents the ideal diagnostic rate.
In SPC, one talks of variation due to an "assignable cause" [a modifiable factor] and "common cause" [unmodifiable factors]. In the language of SPC, the question succinctly is: Is the pathologist an assignable cause?
If diagnostic rates are stable [Condition 1], and the pathologist is an "assignable cause" [Condition 2], SPC should be feasible.