Diagnostic accuracy of salivary gland ultrasonography with different scoring systems in Sjögren’s syndrome: a systematic review and meta-analysis

Noninvasive objective salivary gland ultrasonography (SGU) had been widely used to evaluate major salivary gland involvement in primary Sjögren’s syndrome (pSS) and treatment responses. However, the evaluation score, diagnostic sensitivity, and diagnostic specificity significantly varied among clinical studies. We conducted this meta-analysis to assess the diagnostic accuracy of different SGU scoring systems using the American-European Consensus Group criteria. Of the 1301 articles retrieved from six databases, 24 met the criteria for quality assessment and 14 for meta-analyses. The pooled sensitivities were 75% (0–4) with I2 = 92.0%, 84% (0–16) with I2 = 63.6%, and 75% (0–48) with I2 = 90.9%; the pooled specificities were 93% (0–4) with I2 = 71.5%, 88% (0–16) with I2 = 65.4%, and 95% (0–48) with I2 = 83.9%; the pooled diagnostic odds ratios were 71.26 (0–4) with I2 = 0%, 46.3 (0–16) with I2 = 73.8%, and 66.07 (0–48) I2 = 0%; the areas under the SROC curves were 0.95 (0–4), 0.93 (0–16), and 0.94 (0–48). These results indicated that the 0–4 scoring system has a higher specificity and a less heterogeneity than other systems, and could be used as a universal SGU diagnostic standard.

among these clinical studies  . Therefore, a meta-analysis of these exiting clinical studies is needed to evaluate which scoring system has lower heterogeneity. To our knowledge, the meta-analysis conducted by Delli et al. is the only meta-analysis that assessed the diagnostic properties of SGU in the diagnosis of SS 2 . It has been established that a single gold standard should be used in meta-analysis. However, multiple gold standards i.e., FC, JDC, CC, TC, ECSG, AECG, RJDC, were used by Delli et al. 2 . In addition, Delli et al. did not performed subgroup analysis, likely introducing bias. Therefore, a meta-analysis of these exiting studies by subgroups using a single gold standard is urgently needed to recommend a guideline regarding whether SGU is a highly specific pSS diagnostic tool and which SGU scoring system can be used as an universal diagnostic standard. To that end, we used the American-European Consensus Group (AECG) criteria as the gold standard and performed subgroup analyses per SGU scoring system.

Results
Study identification and selection. A total of 1301 studies were identified in the six databases. One thousand one hundreds and eighty-five studies were excluded per titles and abstracts; 92 per the exclusion criteria. The quality assessment was performed using QUSDAS-2 in the remaining 24 studies (Fig. 1), all of which used the AECG criteria for diagnosis of SS. Of the 24 studies, one study didn't report about the scoring system clearly 17 , one used a self-defined complex scoring system 21 . Four scoring systems were used in 22 studies. Because 0-12 scoring system was used in only two studies, the final meta-analysis focused on the included 14 studies with three scoring systems as subgroups (0-4, 0-16, and 0-48).

Study characteristics.
A total of 3360 patients were enrolled in the 24 studies, including 1976 SS patients and 1384 control subjects (Table 1). Fifteen studies only included pSS patients, 3 studies included both pSS and secondary SS (sSS) patients, and 6 studies didn't specify the type of the disease (pSS or sSS); 4 studies used sSS patients as control, 18 studies used subjects with sicca symptoms but not SS as control, 12 studies used healthy controls as control. Overall, 12 studies had only one control group; 10 studies had more than one control groups; and 2 studies had no control group.
Ultrasonography scoring systems and the subgroups. Fourteen studies used 0-4 scoring system including 0-3 scoring system (Table 1) 34 . The positive criteria is mild parenchymal inhomogeneity seen as multiple hypoechogenic areas measuring <2 mm with blurred borders. Eight Studies used the scores 0-4 for counting. Six studies used the scores 0-3 for counting (the positive criteria was mentioned above). Seven studies were excluded including one study using AECG criteria partly as gold standard 24 , two using the positive criteria lower than the above-mentioned criteria 12,15 , two having no control groups 14,26 , one using a self-defined complicated scoring system 21 , and one using the same patient population 30 . The remaining seven studies were included as the 0-4 scoring subgroup in the meta-analysis 10,11,25,27,[31][32][33] . Two studies used 0-12 scoring system 19,29 . The scores ranged from 0 to 12, and ≥6 score was considered as positive criteria. The 0-12 scoring system was excluded from meta-analysis due to small sampling.
Six studies used 0-16 scoring system, which was first reported by Salaffi et al. 35 . The scores ranged from 0 to 16. One study considered ≥5 as positive criteria 33 ; two studies ≥6 16,29 ; two studies ≥7 23,28 ; and one study ≥8 20 . These six studies included as the 0-16 scoring subgroup in the meta-analysis.
Five studies used 0-48 scoring system, which was first reported by Hocevar et al. 13 . The scores ranged from 0 to 48. Two studies considered ≥17 as positive criteria 13,29 ; one study ≥15 28 ; one study ≥19 18 ; and one study didn't describe the cut-off value, which was excluded 22 . The four studies, which described the cuff-off values, were included as the 0-48 scoring subgroup in the meta-analysis.
The 0-4 scoring system has the least variations in specificity and diagnostic OR (0.90-0.95 and 42.29-120.09, respectively) when compared with the 0- 16    In addition, the 0-4 scoring system had a universal cut-off value of 1 or 2 while the other two scoring systems did not. These results indicated that 0-4 scoring system is a more consistent scoring system.

Diagnostic accuracy.
In the 0-4 scoring system for sensitivity, the I 2 index was 92.0%, (df = 6, p < 0.001) with a pooled sensitivity was 75% (71-79%) ( Table 3); for specificity, the I 2 index was 71.5%, (df = 6, p = 0.0018) with a pooled specificity was 93% (90-95%); for DOR, the I 2 index was 0%, (df = 6, p = 0.643) with the pooled DOR was 71.26 (42.29-120.09). In the 0-16 scoring system for sensitivity, the I 2 index was 63.6%, (df = 5, p = 0.0174) with a pooled sensitivity was 84% (81-87%); for specificity, the I 2 index was 65.4%, (df = 5, p = 0.0129) with a pooled specificity was 88% (85-91%); for DOR, the I 2 index was 73.8%, (df = 5, p = 0.0019) with the pooled DOR was 46.3 (19.95-107.44). In the 0-48 scoring system for sensitivity, the I 2 index was 90.9%, (df = 3, p < 0.001) with a pooled sensitivity was 75% (70-80%); for specificity, the I 2 index was 83.9%, (df = 3, p = 0.0003) with a pooled specificity was 95% (91-97%); for DOR, the I 2 index was 0%, (df = 3, p = 0.551) with the pooled DOR was 66.07 (33.73-129.42). In summary, 0-16 scoring system had the highest sensitivity (84%) with relatively small I 2 (63.6%); 0-48 scoring system had the highest specificity (95%), which was similar to that of 0-4 (93%); 0-4 and 0-48 scoring systems had the best DOR (I 2 = 0%). Due to the different cut-off values of the scoring systems, SROC analyses were performed (Fig. 2). The summary operating sensitivities were 78% (65-88%) (0-4 scoring system), 85% (79-89%) (0-16 scoring system), and 78% (61-89%) (0-48 scoring system), respectively; the summary operating specificities were 95% (89-98%) Taken all together, the heterogeneities of the pooled DOR for 0-4 and 0-48 scoring systems was 0%, or no heterogeneities, suggesting that these two scoring systems be reliable. However, it seemed that the 0-4 scoring system was the best among the three scoring systems because (i) the cut-off value was pre-specified in the 0- Quality assessment and risk of bias of the studies. High risk of bias was observed in "patient selection" due to the variations of inclusion and exclusion criteria (e.g., whether a case-control study was included or excluded) as well as the patient selection criteria (i.e., whether patients were enrolled consecutively or randomly) (Fig. 3). High risk of bias was also observed in the "conduct and interpretation of index test" due to the designs of the original studies (e.g., whether the SGU results were interpreted with the knowledge of the SS diagnosis; whether a threshold was pre-specified).  Twenty-four studies were included in the QUADAS-2 quality assessment, including 21 studies used only one scoring system, 2 studies used two scoring systems, and 1 study used three scoring systems ( Table 1). The most frequent high risks of bias were biases due to patient selection and index test. In particular, 14 (58.3%) studies and 10 (41.7%) studies were rated as "high risk" in terms of the biases due to patient selection and due to index test (Table 4). In contrast, all the studies were rated as "low risk" in terms of the biases due to reference standard and due to flow and timing. More studies had low concerns in the applicability of patient selection (58.3%) than in the applicability of index test (41.7%) and the applicability of reference standard (0%). These results indicate that the applicability of SGU was high.
An ultrasound picture scored with different scoring systems. Direct comparisons among different scoring systems on a same patient was performed (Fig. 4). 0-4 system is significantly distinguished from the other 3 systems while the 3 systems proportionally project among each other in essence (Fig. 4, lower right).

Discussion
Plenty of clinical researches using different diagnostic criteria and scoring systems indicated that SGU is sensitive and specific to pSS. In contrast, the meta-analyses on these studies were scarce. Only two meta-analysis studies have published regarding the diagnostic value of SGU in SS patients 2,36 . One study compared the diagnostic properties of ultrasonography and sialography in SS, demonstrating that ultrasonography was comparable with sialography 36 . However, this meta-analysis only included six studies, and could not explain the diagnostic value of ultrasonography in SS. In addition, the assessment of research methodology was less rigorous, with high risk of bias in all QUADAS-2 domains, resulting in concerns regarding the outcomes 2 . The other study performed a good quality assessment 2 . However, this meta-analysis did not distinguish the diagnostic criteria and the scoring systems. In addition, the studies included were not rigidly designed and performed as their results showed significant heterogeneity. Therefore, quality of the pooled outcomes (sensitivity, specificity, and diagnostic odds ratio) was low. The likely source of this heterogeneity was the ultrasonography scoring systems. To our knowledge, the current study is the first meta-analysis to perform subgroup analyses regarding different scoring system using only one diagnostic criterion.
The different cut-off values in the 0-16 and 0-48 systems resulted in relative large heterogeneity of sensitivity and specificity. To decrease this heterogeneity, we conducted SROC curve analysis in the three subgroups. Our results indicted all the three systems are reliable diagnostic tools with similar accuracy (SROC AUC 0.95 (0-4), 0.93 (0-16), and 0.94 (0-48)).
In the 0-4 system, the sensitivity was 75%, specificity 93%, diagnostic DOR heterogeneity 0%, cut-off pre-specific. In addition, the operation was simple and the operation time was shorter. These advantages allowed the 0-4 system to outperform 0-16 and 0-48 systems. In contrast, although both 0-16 and 0-48 systems were reliable scoring systems with similar AUC, the cut-off values were not pre-specified indicating that these scoring systems could not be used as SGU diagnostic standard. Taken all together, the 0-4 scoring system seems to be a better scoring system being used as a universal SGU diagnostic standard with a higher specificity and a less heterogeneity than the other scoring systems (0-16 and 0-48). Actually, the 0-4 system is significantly distinguished from the other 3 systems (Fig. 4, lower right).
This study has several strengths. First, mainly four scoring systems are used clinically, each of which has it own advantages. It is of clinical significance to meta-analyze different scoring systems as subgroups, respectively, to decrease possible heterogeneity and establish which scoring system is overall the best. Our results indicted that  the 0-4 scoring system was the best among the three scoring systems as the diagnostic criterion. In particular, the heterogeneity of the pooled DOR for 0-4 and 0-48 scoring systems was 0%, or no heterogeneity, indicating that these scoring systems are reliable. Between the two scoring systems, we think 0-4 scoring system is better, because the cut-off value is pre-specified. In contrast, the cut-off value of 0-48 scoring system is different among studies. In addition, the heterogeneity of the pooled specificity was high in 0-48 system. The heterogeneity of the pooled sensitivity of 0-4 and 0-48 scoring system was both very high. This might relate with the selection of patients and control groups.    April 15, 2018) were searched with the keywords ("salivary gland", "parotid gland", or "submandibular gland") and ("ultrasonography", "ultrasound", or "sonography"), and ("Sjögren's syndrome", "Sjögren syndrome", "sicca syndrome", or "sicca").

Study Selection.
Inclusion criteria were studies containing data on the diagnostic value of SGU for pSS, using AECG criteria as the diagnostic criteria, and including more than 20 cases. Exclusion criteria for titles and abstracts included: case reports, case series with fewer than 20 cases, letters to the editor, experts' opinions, review articles, studies without diagnostic value of SGU, and studies used non-AECG criteria as diagnostic criteria. The studies were fully assessed if the title and abstract only provided limited information or in case of doubt. Two independent researchers (M.Z. and S.S.) initially evaluated the titles and abstracts for eligibility per inclusion and exclusion criteria. Disagreements were resolved through consensus. The full texts of eligible studies were screened by the diagnosis criteria of SS. The studies using the AECG criteria as the golden standard were finally selected for this study. Data Extraction. Two researchers (M.Z. and S.S.) extracted the data independently. Disagreements were resolved through consensus. Extracted information included description of population, publication year, study type, study design, diagnosis criteria for SS, the definition of the scoring systems in studies (Supplemental Materials STable 1), and ultrasonographic scoring system as well as true positive, true negative, false positive, and false negative.
Quality Assessment. Two researchers (M.Z. and S.S.) assess the quality of the studies per QUADAS-2 (the revised Quality Assessment of Diagnostic Accuracy Studies) tool. Disagreements were resolved by discussion.
Statistical Analysis. Selected studies were further divided into three subgroups, 0-4, 0-16, and 0-48 ultrasonographic scoring systems. The pooled diagnostic sensitivity, specificity, and odds ratio (DOR) were calculated for each subgroups. The heterogeneity of the pooled sensitivity, specificity, and DOR were measured by the inconsistency (I 2 ) and Cochran Q test. The heterogeneity was a measure of the between-study variations and was used to assess whether the studies in a meta-analysis represented a single population or several different populations. The percentage measures of the heterogeneity among the enrolled articles were calculated as I 2 index. Small heterogeneity in the enrolled articles was defined as I 2 < 25%, moderate heterogeneity was defined as I 2 25-50%, obvious heterogeneity was defined as I 2 > 50%. The Cochran Q test was used for calculating heterogeneity (P < 0.05). The random effects model was used for data analysis.
The risk of bias of the included studies was assessed by QUADAS-2 tool 37 . Quality assessment was performed with Review Manager software (version 5.3, The Nordic Cochrane Centre, The Cochrane Collaboration). Pooling of sensitivity, specificity, DOR, and heterogeneity test were performed with Meta-Disc software (version 1.4, Madrid, Spain). The summary receiver operating characteristic (SROC) curves were produced in STATA13.0.