Although ovarian tumours are common, most are not malignant (Menon et al, 2009). Correctly characterising ovarian tumours is critical, as this ensures appropriate referral of patients with cancer to specialised surgeons, which is crucial to optimise patient care and survival (Vergote et al, 2001; Earle et al, 2006; Engelen et al, 2006; Paulsen et al, 2006). By correctly recognising benign ovarian masses, conservative management may be adopted, leading to reduced morbidity while facilitating fertility preservation (Carley et al, 2002; Tinelli et al, 2006).

The most accurate way to characterise adnexal pathology is subjective assessment of ultrasound findings by experienced examiners (Timmerman et al, 1999; Timmerman, 2004; Valentin et al, 2009). However, training and experience of performing transvaginal ultrasonography varies. To mirror the test performance of experienced examiners, several ultrasound-based prediction models have been developed to help operators accurately discriminate between benign and malignant adnexal masses (Jacobs et al, 1990; Timmerman et al, 2010a; Van Holsbeke et al, 2012). The Risk of Malignancy Index (RMI) includes serum CA125 levels, menopausal status and ultrasound findings (Jacobs et al, 1990). The International Ovarian Tumour Analysis (IOTA) group developed and validated a logistic regression model (LR2) with five ultrasound parameters, which has shown excellent discrimination between benign and malignant masses (Timmerman et al, 2005; Timmerman et al, 2010b; Van Holsbeke et al, 2012). Furthermore, the IOTA group has described simple rules based on five ultrasound features indicating malignancy (M-features) and five features suggesting a benign lesion (B-features) (Timmerman et al, 2008). These rules have shown good performance on temporal and external validation (Timmerman et al, 2010a). A criticism of these prediction models is that they were developed and validated by experts in characterising adnexal pathology (Timmerman, 2004; Timmerman et al, 2005; Timmerman et al, 2008; Timmerman et al, 2010a; Timmerman et al, 2010b; Van Holsbeke et al, 2012). Accordingly, we do not know if these models maintain performance in the hands of operators with different training backgrounds and experience levels.

The primary aim of this study was to examine the performance of the IOTA LR2 model, ultrasound-based Simple Rules (SR), RMI and subjective assessment (SA) by the examiner for the preoperative characterisation of ovarian masses, when ultrasonography is performed by examiners with a range of training backgrounds and experience. We aimed to validate the performance of these approaches to the diagnosis of adnexal pathology in everyday ‘real world’ clinical practice.

Materials and methods

Study design and setting

This was a prospective multicentre cross-sectional cohort study (IOTA Phase 4B). The patients were recruited from three hospitals: two tertiary referral centres for gynaecological oncology (Queen Charlotte’s and Chelsea Hospital, London (QCCH); Princess Anne Hospital, Southampton (PAH)) and one urban acute hospital partnered to Imperial College (West Middlesex University Hospital, London (WMUH)). The study was approved as an assessment of ‘service improvement’ by the local Joint Research Office at Imperial College Academic Health Science Center and the Research and Development Department at Southampton University Hospitals. Accordingly, no formal ethical approval was required. The guidelines of the STARD (Standards for Reporting of Diagnostic Accuracy) initiative were used (Bossuyt et al, 2003).

Patients were recruited consecutively from September 2010 to September 2012 at QCCH, February 2012 to September 2012 at WMUH, and May 2012 to September 2012 at PAH. All ultrasound examiners attended a half-day theoretical induction session where the ultrasound features of the rules and models used in the study were illustrated. None of the examiners were considered specialist ‘experts’ (level III) in performing ultrasound examinations of the ovary (EFSUMB, 2006; RCR, 2012).

Patient population and data collection

The inclusion criteria were patients presenting with at least one adnexal mass that underwent transvaginal ultrasonography at one of the participating centres. In the event of bilateral adnexal masses, the mass with the most complex ultrasound morphology was included (Timmerman et al, 2000, 2010b). If both masses had similar ultrasound morphology, the largest mass, or the one most easily accessible by ultrasonography was included (Timmerman et al, 2010b).

The exclusion criteria were (i) pregnancy, (ii) patients examined by a consultant with a special interest in gynaecological ultrasound, (iii) refusal of transvaginal ultrasonography, (iv) cytology rather than histology as an outcome, and (v) failure to undergo surgery within 120 days of the ultrasound examination.

At QCCH, a secure electronic data-collection system was developed for the study (Astraia Software, Munich, Germany). A unique identifier was generated automatically for each patient’s record. Dedicated data collection forms were used for WMUH and PAH. Data security was ensured following the NHS Caldecott report guidelines (The Caldicotte Committee, 1997). Recorded clinical variables included age, current pregnancy (yes, no), and menopausal status. Women 50 years who had undergone hysterectomy were defined as postmenopausal.

Transvaginal ultrasonography was performed in the standardised manner previously published by the IOTA collaboration (Timmerman et al, 2000; Timmerman et al, 2010b). Transabdominal ultrasonography was performed if a large mass could not be fully assessed transvaginally (Timmerman et al, 2010b). Subjective assessment of the ultrasound findings was used to classify the masses as malignant or benign. Borderline tumours were considered malignant. RMI, LR2 and SR were applied centrally and checked by statisticians at the end of the study.

Operator experience was quantified by four variables using the operator’s first patient recruitment date as a reference point for time: number of years of gynaecological scanning, number of gynecology scans performed, number of ovarian masses examined, and background training (sonographer or medical doctor (MD)).

Prediction models

The logistic regression model LR2 uses six variables: (1) patient age (years); (2) presence of ascites (yes=1, no=0); (3) presence of blood flow within a papillary projection (yes=1, no=0); (4) maximal diameter of the solid component (expressed in mm and truncated at 50 mm); (5) irregular internal cyst walls (yes=1, no=0); and (6) presence of acoustic shadows (yes=1, no=0). The logistic regression model LR2 estimates the probability of malignancy for an adnexal tumour as 1/(1+exp(−z)), where z=−5.3718+0.0354(1)+1.6159(2)+1.1768(3)+0.0697(4)+0.9586(5)−2.9486(6). A probability cutoff of 0.1 (10%) was used to classify patients as benign or malignant based on LR2 (Timmerman et al, 2005; Timmerman et al, 2010b; Van Holsbeke et al, 2012).

The SR are based on five ultrasound features of malignancy (M-features) and five ultrasound features suggestive of a benign lesion (B-features) (Timmerman et al, 2008; Timmerman et al, 2010a). An ovarian mass is classified as malignant if at least one M-feature and no B-features are present and vice versa (Timmerman et al, 2008; Timmerman et al, 2010a). When no B- or M-features are present or if both B- and M-features are present, then SR are considered inconclusive (uncertain) and a different diagnostic method should be used (Timmerman et al, 2008; Timmerman et al, 2010a). For SR, two approaches were used: one where all inconclusive cases were classified as malignant to limit the number of missed cancers (SR+MA), and another where inconclusive cases were classified as benign or malignant using SA by the examiner (SR+SA).

Measurements of serum CA125 were carried out according to each centre’s normal practice, using Abbott Architect CA125 II (Abbott Park, IL, USA) immunoassay kit at QCCH, Advia Centaur XP Immunoassay System (Centaur) (Siemens Healthcare Diagnostics Inc., Deerfield, IL, USA) at WMUH and UniCel DxI Immunoassay System (Beckman Coulter Inc., Brea, CA, USA) Assay at PAH.

For the RMI, five features were incorporated into the ultrasound score (U): multilocularity, solid areas, bilateral masses, ascites and evidence of metastases. U was assigned a value of 0 when none of these features was present, 1 if one feature was present and 3 if two or more features were present. A score (M) of 1 was assigned to premenopausal and 3 to postmenopausal women. Risk of Malignancy Index was defined as U × M × (serum CA125 (U ml−1)). An RMI score of 200 was used as the cutoff value to indicate cancer (Jacobs et al, 1990).

Reference standard

The final outcome was the surgical findings and histological diagnosis of removed tissues, and the classification of these as benign or malignant. Borderline tumours were classified as malignant tumours. Surgery was performed by laparoscopy or laparotomy, according to the surgeon’s judgment. Excised tissues underwent histological examination at the local Department of Pathology. Tumours were classified using the criteria recommended by the International Federation of Gynecology and Obstetrics (Heintz et al, 2006).

Statistical analysis

For LR2 and RMI, receiver-operating characteristic curves were derived and summarised using the area under the curve (AUC) with 95% confidence interval (CI) using the logit transform method (Pepe, 2003). Since 70% of the patients were collected at one hospital, we report AUCs computed on the whole sample instead of performing a random effects meta-analysis of hospital-specific AUCs, the results of which were nearly identical. For SR, the classification has essentially three ordinal levels: benign, inconclusive (uncertain), or malignant. A receiver-operating characteristic curve for this classification was computed, which has two points (one for benign vs inconclusive/malignant, and one for benign/inconclusive vs malignant). This was done to allow a visual comparison of the performance of SR with RMI and LR2. However, an AUC was not derived because this would not be comparable to AUCs of models that give continuous results. In addition, we explored whether the performance of LR2 and SR differed from the performance of RMI. We examined the performance in pre- and postmenopausal women separately. For differences in AUC, the method of DeLong et al (1988) was used to generate the 95% CI.

Diagnostic performance measures were computed for the classification as benign or malignant based on RMI, LR2, SR and SA. Reported diagnostic performance measures were sensitivity, specificity, positive and negative likelihood ratios (LR+ and LR−), and the diagnostic odds ratio.

Missing CA125 levels (n=19) were statistically imputed using predictive mean matching regression. Owing to heavy skewness, the double log of CA125 is predicted using variables used in the prediction models, tumour pathology groups (Van Calster et al, 2011) and hospital (QCCH, WMUH, PAH). In all (n=19) these cases, the RMI value was zero irrespective of the CA125 level, as the ultrasound score was zero.

We conducted an exploratory analysis of the influence of experience on the performance of subjective impression and the prediction models. A regression model for accuracy of subjective impression or a model was fitted using the number of ovarian mass scans (7 ordinal categories; <100, 100–200, 200–500, 500–1000, 1000–2000, 2000–5000 and 5000–10 000), background training (sonographer or MD), and tumour outcome (benign or malignant) as predictors. Outcome was added to adjust the effects of the predictors. A mixed effects model was used to account for the clustering of patients within operators. All analyses were performed using SAS 9.3 (SAS Institute, Cary, NC, USA).


During the study period, 962 women with an adnexal mass underwent ultrasonography and 282 of these patients were managed surgically. Twenty-seven cases were excluded: five because of pregnancy, ten were examined by a senior consultant (level III scan), five had cytology rather than histology as a final outcome, one died before surgery, five patients declined surgery and one patient had surgery >120 days of the index ultrasound scan (Figure 1). Five cases were included where histology was not available. Two cases of ovarian torsion were confirmed at laparoscopy and de-torted. The ovaries were normal in size and morphology on two follow-up ultrasound scans 3 and 6 months after the procedure. A further three cases were included where an abscess was diagnosed surgically and confirmed by microscopy and culture. The mean age of the patients was 46 years (95% CI: 34–57). One-hundred and sixty-five patients (65%) were premenopausal. The prevalence of malignancy was 29% (74 malignancies vs 181 benign ovarian tumours). The 74 malignancies included: 49 primary invasive epithelial ovarian cancers, 18 borderline ovarian tumours, and 7 metastatic tumours (Table 1).

Figure 1
figure 1

A flow chart illustrating the final sample size and the numbers of excluded cases.

Table 1 Different histological outcomes of ovarian lesions in the study

For the whole study population, the diagnostic odds ratio for LR2, RMI, SR+SA, SR+MA and SA were 62 (95% CI: 27–142), 43 (95% CI: 19–97), 109 (95% CI: 44–274), 66 (95% CI: 27–158) and 70 (95% CI: 30–163), respectively (Table 2). Overall, our data suggested a significantly higher AUC for LR2 compared with RMI: 0.94 and 0.90, respectively, with an LR2−RMI difference of 0.04 (95% CI: 0.01–0.07) (Table 2 and Figure 2). The difference in AUC between LR2 and RMI was greatest in premenopausal women (AUCs of 0.92 and 0.83 for LR2 and RMI, respectively, with a difference of 0.09, 95% CI: 0.03–0.15) but little difference was observed in postmenopausal patients (0.90 and 0.92, respectively, difference −0.02, 95% CI −0.08 to 0.04; Table 2, Supplementary Figures S1 and S2). The AUCs for discrimination between benign and borderline tumours were 0.86 (95% CI: 0.75–0.97) for LR2 and 0.77 (95% CI: 0.64–0.89) for RMI. The AUCs for discrimination between benign tumours and stage I invasive cancers were 0.94 (95% CI: 0.88–1.00) for LR2 and 0.91 (95% CI: 0.84–0.99) for RMI (Supplementary Table A).

Table 2 Sensitivity, specificity, LR+, LR−, DOR and AUC for diagnostic models in the whole sample, premenopausal group, and postmenopausal group
Figure 2
figure 2

Receiver-operating characteristic (ROC) plot for all masses. Abbreviations: LR2=Logistic Regression model 2; RMI=Risk of Malignancy Index; SR=Simple Rules have three levels (benign, inconclusive, and malignant) and is represented by a ROC curve with two points. SA=subjective assessment; SR+SA=SR and using SA by examiner when SR were inconclusive.

The SR were able to classify 83.9% (n=214) of the masses as benign or malignant. Of the 41 tumours where the IOTA SR were uncertain, 20 were benign and 21 malignant. When SR were able to characterise the ovarian mass, sensitivity was 87% (95% CI: 75–93%) and specificity was 98% (95% CI: 95–99%).

A strategy classifying all SR inconclusive tumours as malignant (SR+MA) yielded a significantly higher sensitivity (91%) than using the RMI (72%) (difference in sensitivity 0.19, 95% CI: 0.07–0.31). However, the specificity of this strategy was lower (87% vs 94% for SR+MA and RMI, respectively) (difference −0.07, 95% CI: −0.13 to 0.01). When examiners used their own SA as a second-stage test when SR were inconclusive, sensitivity was significantly higher than for RMI: 86% and 72%, respectively (difference 0.15, 95% CI: 0.02–0.27) with no difference in specificity.

In all, 62.9% of the operators have performed <1000 ultrasound scans (Table 3); 24% of the operators were MDs, whereas 76% were sonographers. The exploratory analysis of the influence of operator experience and training on diagnostic performance suggested that MDs were more able to subjectively assess the correct diagnosis than sonographers (odds ratio 2.59, 95% CI: 0.77–8.74) (Figure 3). When using SR+MA and LR2 to classify masses as benign or malignant, the odds ratios were 1.10 (95% CI: 0.32–3.81) and 0.68 (95% CI: 0.18–2.61, respectively, suggesting similar performance of these models in the hands of MDs and sonographers. When using the RMI, the odds ratio was 0.32 (95% CI: 0.06–1.70), suggesting slightly better performance for sonographers. The number of previous ovarian mass scans had little effect, with odds ratios between 0.85 and 1.01 for each category increase on the ordinal measurement scale. Adding hospital as a fixed effect in the mixed effects model had no influence on the final results (Supplementary Table B).

Table 3 The number of ovarian mass scans performed by operators
Figure 3
figure 3

Plot of odds ratios OR (95% CI) of MD vs sonographer for each of the models. Dot: OR, line segment: 95% CI, dashed line: OR of 1 (no accuracy difference between sonographer and MD). Abbreviations: OR=odds ratio; SA=subjective assessment; SR+MA=Simple Rules and malignancy assumption when simple rules are not applicable; LR2=Logistic Regression model 2; RMI=Risk of Malignancy Index.


We have shown that the IOTA LR2 model and SR perform well in the hands of examiners with different background training or relatively little experience using ultrasonography. Criticism of papers describing the external validation of IOTA and other models has focused on the fact that they were developed and tested by examiners with a specific expertise in imaging of adnexal pathology (Timmerman et al, 2005; Timmerman et al, 2008; Timmerman et al, 2010a; Timmerman et al, 2010b; Van Holsbeke et al, 2012; Kaijser et al, 2013). In contrast, in the current study the ultrasound scans were performed by examiners with different training (sonographers and doctors) and level II experience. Our findings agree with the IOTA group external validation for LR2, where the AUC for LR2 was 0.94 compared with 0.90 for RMI for the whole study population (Van Holsbeke et al, 2012). Despite sample size limitations when stratifying for menopausal status, our results were similar to the IOTA external validation study, with LR2 offering a clear diagnostic advantage over RMI for premenopausal patients, although not in the postmenopausal group.

To our knowledge, this study represents the first external validation of the IOTA LR2 and SR by examiners with a range of experience and training; furthermore, the patients were seen in different centres. As most ovarian pathology is probably examined by sonographers or doctors who do not have a special interest in gynaecologic ultrasonography (level II), it seems reasonable to suggest that our findings offer clinicians a clearer idea on the performance of the different adnexal mass risk models in daily practice. In 2012, Nunes et al (2012) externally validated the IOTA LR2 model on 124 women by a single relatively inexperienced gynaecologist (level II). They reported an AUC of 0.93 for LR2 but did not compare RMI, LR2 and SR nor stratify the AUCs according to menopausal status.

A strength of our study is that it adhered to a strict prospective protocol, took place in three units and drew on a relatively large number of examiners. A weakness common to other studies is the difficulty encountered in classifying operator experience. Similarly, when the Royal College of Radiologists in the United Kingdom published recommendations for ultrasound training for medical and surgical specialties, it found it difficult to define boundaries between the three levels of ultrasound scanning experience proposed (RCR, 2012). In our study, 67 (25.97%) patients were examined by sonographers, who are not considered in the Royal College of Radiologists recommendations (RCR, 2012). Interestingly, we found that subjective impression of the nature of an adnexal mass tended to be better by medically trained examiners than sonographers. However, this difference was not seen when doctors and sonographers were asked to enter ultrasound findings into the prediction models LR2 and RMI or when they used SR (Figure 2). This is likely to reflect variations in training, as sonographers are in general taught to identify and report the structures they see. Hence, they are skilled at accurately entering the presence or absence of the structures required for use in prediction models but are less likely to offer an opinion on the final diagnosis. This is an important observation, as the original aim of the IOTA study was to develop tools that could be used by all examiners to enhance their diagnostic performance.

In our study, there was a variation in the CA125 kits used in each centre. This slight variation was previously assessed and found to have very limited impact on the variation in diagnostic accuracy of these kits (Davelaar et al, 1998). Moreover, it has been suggested that the use of different CA125 assay kits reflects ‘real world’ clinical practice and will produce more generally applicable results (Van Calster et al, 2011).

An advantage to using LR2 is it provides clinicians with absolute risks of a patient having ovarian cancer, which may contribute to patient counselling and shared decision making. In clinical practice, calculating LR2 may sound more difficult to use than SR. To facilitate its use, the LR2 formula can easily be made available online, and incorporated into mobile applications or computer software (Van Belle et al, 2012). Our data show that overall diagnostic performance is better with LR2 compared with RMI, but also suggest that LR2 misses fewer borderline (AUC of 0.86 for LR2 vs 0.77 for RMI) and stage 1 invasive ovarian cancers (AUC of 0.94 for LR2 vs 0.91 for RMI).

In our study, SR could be applied to 83.9% of the study population compared with 77% in the original IOTA external validation (Timmerman et al, 2010a). The sensitivity and specificity for SR in the hands of the examiners in our study was 87% and 98%, compared with 92% and 96%, respectively, in the original IOTA study (Timmerman et al, 2010a). The utility of SR is supported by Fathallah et al (2011), who conducted a single-centre external validation study on 122 ovarian tumours over 4 years. They found SR were applicable in 89.3% of the study population, with a sensitivity of 73% and specificity of 97%. However, they did not evaluate different strategies for second-stage tests in the event of SR being inconclusive (Fathallah et al, 2011). Ideally, when the SR are inconclusive, the patient should be referred to an expert in gynaecological scanning for further assessment (level III) as an optimal second-stage test. In the absence of level III ultrasonography, our data suggest that if SR are inconclusive, an acceptable second-stage test for level II doctor ultrasound examiners is the subjective impression of the scan findings. For sonographers, however, a reasonable strategy would be to classify all such lesions as malignant. When SR are inconclusive, another alternative to be considered, especially when an experienced level III ultrasound examiner is not available, is to refer the patient for an MRI for these more difficult masses (Bernardin et al, 2012). However, further studies are needed before adopting this as a protocol.

Correctly classifying the nature of ovarian pathology is a common diagnostic problem in gynecology, and correctly identifying the presence of cancer in these cases is the key to ensure patients access appropriate treatment. This study shows that the IOTA LR2 model and SR perform well in the hands of both relatively inexperienced doctors and when used by sonographers. Furthermore, although not the primary aim of this study, our data suggest the performance of the both LR2 and SR may be better than the RMI. These findings suggest that LR2 or SR may replace the RMI in protocols designed to evaluate suspected adnexal pathology, particularly when dealing with premenopausal women.