Comparative accuracy of cervical cancer screening strategies in healthy asymptomatic women: a systematic review and network meta-analysis

To compare all available accuracy data on screening strategies for identifying cervical intraepithelial neoplasia grade ≥ 2 in healthy asymptomatic women, we performed a systematic review and network meta-analysis. MEDLINE and EMBASE were searched up to October 2020 for paired-design studies of cytology and testing for high-risk genotypes of human papillomavirus (hrHPV). The methods used included a duplicate assessment of eligibility, double extraction of quantitative data, validity assessment, random-effects network meta-analysis of test accuracy, and GRADE rating. Twenty-seven prospective studies (185,269 subjects) were included. The combination of cytology (atypical squamous cells of undetermined significance or higher grades) and hrHPV testing (excepting genotyping for HPV 16 or 18 [HPV16/18]) with the either-positive criterion (OR rule) was the most sensitive/least specific, whereas the same combination with the both-positive criterion (AND rule) was the most specific/least sensitive. Compared with standalone cytology, non-HPV16/18 hrHPV assays were more sensitive/less specific. Two algorithms proposed for primary cytological testing or primary hrHPV testing were ranked in the middle as more sensitive/less specific than standalone cytology and the AND rule combinations but more specific/less sensitive than standalone hrHPV testing and the OR rule combination. Further research is needed to assess these results in population-relevant outcomes at the program level.

www.nature.com/scientificreports/ and for primary hrHPV testing 9 have been proposed, the evidence base to improve patient-important outcomes with these algorithms is immature. The comparative effectiveness of alternative screening strategies should be based on a comprehensive assessment of benefits and harms. Given the low incidence and mortality due to cervical cancer in high-income countries and the challenges associated with conducting de novo large and long-term RCTs, decision modeling is an alternative realistic option to better understand the theoretical utility of the screening options 12 . In this regard, comprehensive synthesis of the screening accuracy, a key model parameter of cytological and hrHPV testing and their available combination algorithms reported in rigorously conducted paired-design studies, is a valuable intermediate step. However, recent meta-analyses have focused on either standalone cytological and/ or hrHPV testing [13][14][15] or a comparison of cytological testing with a specific combination algorithm not proposed in guidelines only 16 .
For those studies that assessed the diagnostic accuracy of selected and different pairs of tests of interest and their combination algorithms, network meta-analysis of diagnostic test accuracy studies is a useful approach that can compare all the assessed tests and combination algorithms in a single analysis 17 . The current study aimed to perform network meta-analysis to quantitatively compare and rank the cross-sectional accuracy of all reported screening algorithms based on cytological and hrHPV testing. We specifically focused on the comparative accuracy of guideline-proposed combination algorithms by examining data derived from primary studies of healthy asymptomatic women that addressed verification bias because such bias is commonly observed in cancer screening accuracy studies.

Methods
This extended systematic review is based on an update evidence review conducted for revision of the Japanese Guidelines for Cervical Cancer Screening 18,19 . Although the complete evidence review was planned before analysis, no protocol was registered for this extended review. This report followed PRISMA guidelines for diagnostic test accuracy (PRISMA-DTA) 20 and did not require ethics review or patient consent.
Search strategy. We searched OVID MEDLINE and EMBASE for publications between January 1, 1992, and October 14, 2020, with no language restrictions. The search strategies are detailed in the Supplementary methods. Complementarily, the reference lists of eligible studies and relevant review articles were also screened for other appropriate studies.
Study eligibility. Three paired reviewers independently double screened the first 3000 abstracts in a calibration phase. The same reviewers single screened the remaining abstracts. Two reviewers independently determined the eligibility of potential full-text articles, with discrepancies adjudicated by a third reviewer. Only fully paired-design screening studies of cytology and hrHPV testing, either opportunistic or organized screening, aimed at detecting cervical intraepithelial neoplasia ≥ grade 2 (CIN2+) in healthy asymptomatic women were eligible for inclusion. We included all studies that performed either routine colposcopy-directed biopsy or colposcopy and selective biopsy in all screened women to verify target lesions along with studies that performed either of the colposcopy methods among women with protocol-specified screening results and statistical corrections for data from unverified samples. In studies that analyzed both eligible and ineligible populations, only those with relevant and extractable data were included. In case of multiple publications, we included the publication with the largest sample size (see Supplementary methods for more details). Data extraction. One reviewer extracted descriptive data, which were independently confirmed by another reviewer. Next, two reviewers independently extracted numerical data, with discrepancies resolved by consensus. We preferred cross-tabulated count data over reported accuracy estimates when both data types were extractable (see Supplementary methods for more details).
Operationalization. Cytology results were standardized according to the Bethesda system 21,22 if other classification systems had been used. For studies that used both conventional and liquid-based cytology tests (CC and LBC, respectively), we favored LBC data over CC data; we jointly analyzed both smear preparation methods.
We operationally categorized combination tests as follows: (i) combination algorithms based on the OR rule (women with either test positive were categorized as screening positive while women with both tests negative as screening negative) or the AND rule (women with both tests positive were categorized as screening positive while women with at least 1 negative test as negative) 24 ; (ii) thresholds for cytological testing as, e.g., undetermined significance or worse grades (≥ ASCUS), or low-or high-grade squamous intraepithelial lesions or worse grades (≥ LSIL or ≥ HSIL, respectively); and (iii) hrHPV assays (Table 1) [6][7][8][9]25 . As cross-sectional representation of guidelines-proposed algorithms, we assessed two specific strategies: " ≥ LSIL OR [hrHPV AND ASCUS]", which classified only women with cytologic testing ≥ LSIL, or both by cytologic testing ASCUS and hrHPV testing positive as screening positive; and "HPV16/18(/45) OR [hrHPV AND ≥ ASCUS]", which classified only women Quality assessment. Paired independent reviewers double rated the validity of a study using a risk of bias tool for comparative diagnostic accuracy studies (QUADAS-C) 26 , an extension to the existing Quality Assessment of Diagnostic Accuracy Studies 2 tool 27 . Discrepancies were resolved via consensus. Operationally, a study was defined to have low risk of verification bias only when all screened samples had been histologically verified.
Data synthesis and statistical analysis. The primary outcome was sensitivity and specificity for detecting CIN2+. We used their relative risk values for and absolute differences in (Δ) sensitivity and specificity for any paired alternative screening algorithms (e.g., a standalone test vs. a combination algorithm) as measures of comparative accuracy. Between-study heterogeneity was assessed visually by using crosshair plots of sensitivity and specificity estimates in the receiver operating characteristic (ROC) space 28 . We calculated the average sensitivity and specificity estimates and their derived relative and Δ sensitivity and specificity values with their corresponding 95% credible intervals (CrIs) by using an arm-based, two-stage hierarchical, Bayesian bivariate random-effects network metaanalysis model 29 . Credible regions for the average estimates were constructed by using the standard method 30 . For comparison, we also calculated average sensitivity and specificity estimates separately by using the standard bivariate meta-analysis model for diagnostic accuracy 31 . Hierarchical summary ROC (HSROC) curves were derived on the basis of the estimated parameters 32 .
We performed study-level univariable meta-regression for the following prespecified binary predictors when ≥ 10 studies were available: study location (countries ranked as "very high human development" by the Human Development Index 2017 33 vs. those that were not), study design (histology-based vs. colposcopy-based verification), and type of sample collectors (physicians vs. nonphysicians). Scarce data on young individuals (< 30 years old) precluded meta-regression based on age. Complete details of the methodology, model fitting, choice of prior distributions for parameters assessed, and operational definitions used in sensitivity analyses are provided in the Supplementary methods.
We used the Grading of Recommendation Assessment, Development, and Evaluation (GRADE) tool 34 to assess the certainty of evidence and focused on the comparisons among cytological testing (≥ ASCUS) alone, standalone hrHPV assays, and the guideline-proposed combination algorithms. For calculating false negatives www.nature.com/scientificreports/ (FNs) and false positives (FPs), we assumed a healthy screening population of 1,000 women in which 20 are CIN2 + (i.e., a prevalence of 2%) 13 . We did not evaluate funnel-plot asymmetry because the required tests did not permit valid assessment of the extent and impact of missing studies 20 . All analyses were performed by using WinBUGS 1.4.3 (MRC Biostatistics Unit, Cambridge, UK) and Stata/SE 16.1 (Stata Corp, College Station, TX) 35 . We estimated the probability that the true value (i.e., posterior distribution) of relative sensitivity or specificity was ≥ 1 (or ≤ 1) as a measure of superiority of a test over a comparator test. A conventional, frequentist, two-tailed P-value of 0.05 corresponds to a Bayesian posterior probability of 0.025, which we considered to be the threshold of statistical significance. Characteristics of included studies. All included studies had a prospective design, and 14 studies (52%)

Results
were from high-income countries ( Table 2). The average age of study participants ranged from 25 to 47 years. Data on type of sample collectors was available for 20 studies (74%), with physician collectors in 14 studies and nonphysician providers, typically trained nurses or midwives, in 6 studies. Thirteen studies had used only CC, and 12 had adopted only LBC, whereas two other studies had used both CC and LBC (Table 2). Of the four available hrHPV testing subgroups, HC2 was the most commonly reported hrHPV assay (assessed in 20 studies), whereas six studies assessed PCR-based tests, four genotyped for HPV16/18, and three used mRNAbased tests, of which also genotyped for HPV16/18/45. Data on one or more combination algorithm(s) were available in 19 studies (reported in 20 publications; 70%). The most commonly assessed combinations were

Risk of bias.
Although the studies were predominantly well conducted, their designs varied substantially, and several sources of bias were observed ( Supplementary Fig. S2), such as lack of blinding of the colposcopists or grading pathologists to the screening results. Additionally, verification bias could not be ruled out in studies that did not perform histological evaluation of all samples.
Topology of direct comparisons of alternative screening algorithms. Figure 1 shows the network of compared algorithms available from the 27 studies, and Supplementary Sensitivity and specificity. The sensitivity estimates varied substantially across studies with broad confidence intervals (CIs); the specificity values also varied although their CIs were narrow ( Supplementary Fig. S3). Large between-study heterogeneity was visually noted in studies of HC2, all thresholds of cytological testing, and their combinations. These results were also reflected in large credible and predictive regions of the average sensitivity and specificity in the separately performed standard bivariate meta-analyses ( Supplementary Fig. S4).
Although data points were limited, heterogeneity was less prominent in PCR and PCR-based combinations. See Supplementary Fig. S5 for the average estimates of screening accuracy based on the standard meta-analysis. www.nature.com/scientificreports/ Figure 2 provides the average accuracy estimates and ranking estimated through the network meta-analysis. Overall, the combinations with the OR rule of hrHPV and cytological testing were most sensitive and least specific, whereas combinations with the AND rule of hrHPV and cytological testing were most specific and least sensitive. The rankings estimated in the network meta-analysis reflected the trade-off between sensitivity and specificity by altering the thresholds; lowering the thresholds of cytological testing (e.g., from ≥ HSIL to ≥ ASCUS) led to higher sensitivity but at the cost of reduced specificity, and tightening the thresholds increased specificity at the cost of reduced sensitivity. This behavior resulted in average estimates and rankings for tests or combination algorithms relying on few studies (e.g., HPV16/18-and mRNA-based combinations assessed in only one study each), which were inconsistent with the standard meta-analysis.
Comparative accuracy among combination algorithms based on specific hrHPV assays. The ROC plots of the average accuracy estimates and their credible regions reflected the effect of altering the thresholds in combined cytological testing (i.e., lower thresholds with increased sensitivity and decreased specificity, and higher thresholds with increased specificity and decreased sensitivity) and the effect of combination methods (i.e., the OR rule with increased sensitivity and decreased specificity, and the AND rule with increased specificity and decreased sensitivity) across the subgroups based on alternative hrHPV assays (Fig. 3b-d). Among 45 pairwise comparisons based on cytology, HC2, and their combinations, most (40 [89%] for sensitivity and 42 [93%] for specificity) showed a significant difference, reflecting the effect of the thresholds and combination methods (Fig. 3b, Supplementary Table S8). Similarly, among 36 pairwise comparisons based on cytology, PCR-based tests, and their combinations, 28 (78%) for sensitivity and 27 (75%) for specificity showed a significant difference (Fig. 3c, Supplementary Table S9). In contrast, 10 pairwise comparisons based on mRNA-based combinations (Fig. 3d, Supplementary Table S10), only five (50%) and four (40%) contrasts for sensitivity and specificity, respectively, were significantly different.
Comparative accuracy and GRADE assessment of guideline-proposed combination algorithms. Data on the guideline-proposed algorithms are available for HC2 and PCR-based tests on "≥ LSIL OR [hrHPV AND ASCUS]" and for mRNA-based tests and PCR-based tests on "HPV16/18(/45) OR [hrHPV AND ≥ ASCUS]". Table 3 summarizes the comparative accuracy, and Supplementary Table S11 and Table 4 show the GRADE summary of findings on specific tests or combination algorithms and their comparisons, respectively.
In general, the proposed algorithms were less sensitive but more specific than the standalone component hrHPV assays. However, only HC2-based "≥ LSIL OR [hrHPV AND ASCUS]" and PCR-based "HPV16/18 OR [hrHPV AND ≥ ASCUS]" were significantly less sensitive (the average relative sensitivity ranged from 0.74 to 0.79; Bayesian P(≥ 1) ranged from < 0.001 to 0.003) and more specific (the average relative specificity ranged from Table 3. Comparative accuracy of guideline-proposed combination algorithms. Above the diagonal line (formed by cells with an en dash) represents relative sensitivity (95% CrI) [probability that relative sensitivity is ≥ 1] and below the diagonal line represents relative specificity (95% CrI) [probability that relative specificity is ≤ 1]. For relative sensitivity, the rows and columns, respectively, represent the index (the test of interest) and comparator (the test in comparison) tests or combination algorithms. For relative sensitivity, the columns and rows, respectively, represent the index and comparator tests or combination algorithms. ASCUS atypical squamous cells of undetermined significance, CrI credible interval, HC2 Hybrid Capture 2, HPV16/18(/45) genotyping for HPV types 16 or 18 (or 45), HSIL high-grade squamous intraepithelial lesion, LBC liquid-based cytology, LSIL low-grade squamous intraepithelial lesion, mRNA messenger ribonucleic acid, PCR polymerase chain reaction. www.nature.com/scientificreports/ 1.04 to 1.10; Bayesian P(≤ 1) ranged from < 0.001 to 0.004). These results suggested that the proposed algorithms, compared with their standalone component hrHPV tests, decreased by an average of 44 to 88 FPs but increased 4 to 5 more FNs (very low to low certainty of evidence). In contrast, the proposed algorithms were in general equally specific but more sensitive than standalone ≥ ASCUS. However, only PCR-based "LSIL OR [hrHPV AND ASCUS]" was significantly less sensitive than ≥ ASCUS alone (the relative sensitivity = 0.73 [CrI: 0.59-0.92; Bayesian P(≥ 1) = 0.004]; four more FNs [CrI: [1][2][3][4][5][6][7]; very low certainty of evidence), but evidence as to whether this combination was more specific or less specific than ≥ ASCUS alone was insufficient (relative sensitivity = 0.98 [CrI: 0.96-1.00; Bayesian P(≥ 1) = 0.04]).
Comparative evidence across alternative guideline-proposed algorithms was generally limited. PCR-based "LSIL OR [hrHPV AND ASCUS]" was significantly more specific and less specific than "HPV16/18 OR Meta-regression and sensitivity analyses. Due to data paucity, meta-regression was undertaken for only HC2, cytological testing, and their OR combination separately. Although high-income countries (vs. nonhigh-income countries) for sensitivity of HC2 and sample collection by physicians (vs. nonphysician collectors) for sensitivity and specificity of ≥ ASCUS were associated with higher estimates, these covariates were no longer associated with higher (or lower) sensitivity or specificity in their combination, HC2 OR ≥ ASCUS (Supplementary Fig. S8).
The sensitivity analysis using the model with a common correlation parameter across tests yielded results comparable to those of the main analysis based on the model with test-specific correlation parameters (Supplementary Table S12). Relaxing threshold constraints yielded results not compliant with the expected threshold effects in two specific thresholds for cytological testing (≥ LSIL and ≥ ASCH) and unstable results with wide CrIs www.nature.com/scientificreports/ for sensitivity in four combination algorithms (i.e., mRNA AND ≥ ASCUS, HPV16/18 AND ≥ ASCUS, HPV16/18 OR ≥ ASCUS, " ≥ LSIL OR [PCR AND ASCUS]", and "≥ HSIL OR [HC2 AND ≥ ASCUS]") regardless of whether correlation parameters were separately assumed or not; all of these tests, except for ≥ LSIL, depended on only a few primary studies. With lower deviance information criterion estimates, the models with threshold constraints were deemed to be better-fitting than the models without threshold constraints; however, the differences were < 5, suggesting no definitively preferred model.

Discussion
To the best of our knowledge, this is the first network meta-analysis that has comprehensively compared and ranked the cross-sectional screening accuracy of standalone cytology or hrHPV testing with combination algorithms for detecting CIN2+. Importantly, this analysis is based on published accuracy estimates from fully paired-design comparative accuracy studies that addressed verification bias. First, our network meta-analysis confirmed and quantified the theoretically expected gain in and trade-off of screening performance when combining two tests 24 , that is, the combinations with the OR rule (i.e., either test positive) of hrHPV and cytological testing were most sensitive and least specific, whereas combinations with the AND rule (i.e., both test positive) of hrHPV and cytological testing were most specific and least sensitive. Second, our network meta-analysis confirmed that the guideline-proposed combination algorithms, HC2-based "≥ LSIL OR [hrHPV AND ASCUS]" and PCR-based "HPV16/18 OR [hrHPV AND ≥ ASCUS]" appeared to compensate the shortcomings of the two component tests if used as standalone, which, though expected theoretically, had never been quantitatively synthesized. Specifically, these proposed algorithms were not as sensitive but more specific than the component standalone hrHPV testing. Similarly, these proposed algorithms appeared equally specific but more sensitive than standalone ≥ ASCUS, though definitive conclusions could not be made due to limited comparative data. Third, sparse, insufficient comparative evidence precluded reliable assessment of the comparative accuracy across these alternative guideline-proposed algorithms. Effectiveness of screening should be assessed as a whole program consisting of a set of activities 71 . Since the ultimate goal is to maximize participant-relevant benefits and simultaneously minimize harms, accuracy of testing is, though an important measure, only an intermediate parameter. As already elucidated in the previous meta-analyses 13,14 , which is congruent with our results, standalone testing for hrHPV using an assay other than HPV 16/18 genotyping, if all screen-positive women underwent colposcopy, would identify more women with CIN2+ than cytological testing alone but at the cost of more healthy women misclassified as CIN2+. The OR rule combinations, the most sensitive group of strategies found in our meta-analysis, if used for primary co-testing (i.e., performing both tests concurrently), would further increase the number of healthy women misclassified as CIN2+ while identifying only a few more women with CIN2+. The consequences of such FP results include unnecessary colposcopy, triage, or repeat testing with cytology, hrHPV, or other tests. Although infections with hrHPV, and HPV16/18 in particular, carry a higher risk of progression than positive cytology [72][73][74][75] , immediate incremental costs and psychological burden incurred due to increased false-positive results may not be justified in low risk screening settings as only a fraction of the identified CIN2+ lesions detected through standalone hrHPV testing or its combinations progress to invasive cancer; the others actually carry a moderate chance of regression 76 . The AND rule combinations, the most specific group of strategies identified in our meta-analysis, may substantially minimize FPs and their negative consequences. However, sensitivity is lower than cytology alone (≥ ASCUS), potentially leading to unignorably large numbers of FNs depending on the prevalence of CIN2+ in a screened population.
As interim recommendations, several protocols for triage and/or repeat testing followed by colposcopy for screen-positive women have been proposed by professional societies. "≥ LSIL OR [hrHPV AND ASCUS]" and "HPV16/18 OR [hrHPV AND ≥ ASCUS]" were cross-sectional representations for two such protocols, respectively, proposed for positive primary cytological testing 11 and primary hrHPV testing 9 . Our meta-analysis found that the accuracy of these combination algorithms were generally ranked in the middle, being more sensitive and less specific than standalone cytology (≥ ASCUS) and the AND rule combinations but more specific and less sensitive than standalone hrHPV testing and the OR rule combination. We also quantified how each combination algorithm increased or decreased the number of FNs and FPs relative to those of another specific standalone test or combination, which is a strength of our study results. However, any benefits and harms associated with specific screening tests or combinations should be formally assessed at the whole program level along with its necessary resources and costs 71 .
We focused on cross-sectional accuracy of initial screening tests or combinations and their immediate consequences. Our accuracy-based arguments necessarily lack long-term outcomes. Given the chance of regression 76 , the results based on our cross-sectional approach may be only relevant in populations with a low participation rate of follow-up testing. Additionally, the positive criteria we adopted for the estimation of accuracy do not necessarily represent the optimal indications of colposcopy in real-life practice; rather the criteria included the joint indications of any additional intervention; i.e., triage and/or repeat testing, colposcopy, and immediate direct treatments jointly. In this regard, a recent expert consensus statement proposed individualized risk-based management decisions based on the combinations of the available screening results 77 .
Colposcopy-directed biopsy is an imperfect test even for routine biopsies on normal-appearing sites 78 and more so for colposcopy and selective biopsy 79 . Despite the theoretical superiority of verification bias-corrected accuracy estimates over naïvely calculated estimates, these corrections are not error-free. Given the complex mechanisms of missing verification 80 and limitations in inverse probability weighting 81 , bias may not necessarily have been corrected in the right direction. In addition, the effect of the excluded observations due to unsatisfactory or missing test results, even though the reported proportions were not substantial, could be unpredictably large. Furthermore, our meta-analysis was based on aggregate data and thus only accounted for the dependence