Introduction

Ovarian cancer is the eighth most common malignancy in women worldwide1. It is notoriously difficult to detect and diagnose, with ineffective screening2 and non-specific symptoms similar to those caused by menopause3. Encompassing primary malignant tumours of the ovaries, fallopian tubes, and peritoneum, the disease has often started to spread within the abdomen at the time of diagnosis (FIGO4 Stage 3). This typical late stage at diagnosis makes ovarian cancer a particularly deadly disease, with the 314,000 new cases diagnosed each year translating to 207,000 deaths per year globally1.

Most ovarian cancers are carcinomas (cancers of epithelial origin) which predominantly fall into five histological subtypes: high-grade serous, low-grade serous, clear cell, endometrioid, and mucinous. Non-epithelial ovarian cancers are much less common and include germ cell, sex cord-stromal, and mesenchymal tumours. Ovarian cancer subtypes differ morphologically and prognostically and have varying treatment options5. High-grade serous carcinoma is the most common form of ovarian cancer, accounting for approximately 70% of all cases6.

Histopathology, the examination of tissue specimens at the cellular level, is the gold standard for ovarian cancer diagnosis. Pathologists typically interpret tissue stained with haematoxylin and eosin (H&E), though interpretation can be a subjective, time-consuming process, with some tasks having a high level of inter-observer variation7,8,9. In the assessment of difficult cases, general pathologists may seek assistance from subspecialty gynaecological pathology experts, and/or use ancillary tests, such as immunohistochemistry (IHC). Referrals and ancillary testing can be essential to the accuracy of the diagnostic process but come at the cost of making it longer and more expensive. Worldwide, pathologists are in much greater demand than supply, with significant disparities in the number of pathologists between countries10, and with better-supplied countries still unable to meet demand11.

Traditionally, pathologists have analysed glass slides using a light microscope. However, the implementation of a digital workflow, where pathologists review scanned whole slide images (WSIs) using a computer, is becoming more common. While digital pathology uptake has likely been driven by efficiency benefits12, it has created an opportunity for the development of automated tools to assist pathologists. These tools often aim to improve the accuracy, efficiency, objectivity, and consistency of diagnosis. Such tools could help to alleviate the global workforce shortage of pathologists, increasing diagnostic throughput and reducing the demand for referrals and ancillary tests. This is an increasingly active area of research13 and, for some malignancies, these systems are starting to achieve clinical utility14.

In this study, we systematically reviewed all literature in which artificial intelligence (AI) techniques (comprising both traditional machine learning (ML) and deep learning methods) were applied to digital pathology images for the diagnosis or prognosis of ovarian cancer. This included research that focused on a single diagnostic factor such as histological subtype and studies that performed computer-aided diagnostic tasks such as tumour segmentation. The review characterises the state of the field, describing which diagnostic and prognostic tasks have been addressed, and assessing factors relevant to the clinical utility of these methods, such as the risks of bias. Despite ovarian cancer being a particularly difficult disease to detect and diagnose, and the shortage of available pathologists, AI models have not yet been implemented in clinical practice for this disease. This review aims to provide insights and recommendations based on published literature to improve the clinical utility of future research, including reducing risks of bias, improving reproducibility, and increasing generalisability.

Results

As shown in Fig. 1, the literature searches returned a total of 1573 records, of which 557 were duplicates. Nine hundred and thirty records were excluded during the screening of titles and abstracts, and 41 were excluded based on full paper screening, including 3 records for which full articles could not be obtained. The remaining 45 studies were included in the review, of which 11 were conference papers and 34 were journal papers. All accepted studies were originally identified through searches of research databases, with no records from trial registries meeting the inclusion criteria. While the searches returned literature from as early as 1949, all of the research which met the inclusion criteria was published since 2010, with over 70% of the included literature published since 2020. Study characteristics are shown in Table 1. The 45 accepted articles contained 80 models of interest, details of which are shown in Table 2.

Fig. 1: PRISMA 2020 flowchart.
figure 1

PRISMA 2020 flowchart of the study identification and selection process for the systematic review. Records were screened on titles and abstracts alone, and reports were assessed based on the full-text content. CENTRAL Central Register of Controlled Trials. WHO-ICTRP World Health Organisation International Clinical Trial Registry Platform.

Table 1 Characteristics of the 45 studies included in this systematic review.
Table 2 Characteristics of the 80 models of interest from the 45 papers included in this systematic review, grouped by model outcome.

Risk of bias assessment

The results of the PROBAST assessments are shown in Table 3. While some studies contained multiple models of interest, none of these contained models with different risk of bias scores for any section of the PROBAST assessment, so one risk of bias analysis is presented per paper. All models showed either a high overall risk of bias (37/45) or an unclear overall risk of bias (8/45). Every high-risk model had a high-risk score in the analysis section (37/45), with several also being at high risk for participants (6/45), predictors (11/45), or outcomes (13/45). Less than half of the studies achieved a low risk of bias in any domain (21/45), with most low risks being found in the outcomes (16/45) and predictors (9/45) sections. Nearly all of the papers had an unclear risk of bias in at least one domain, most commonly the participants (36/45) and predictors (25/45) domains. Qualitative summaries are presented in Fig. 2.

Fig. 2: PROBAST risk of bias results.
figure 2

PROBAST risk of bias results summarised for the 45 papers included in this review.

Table 3 PROBAST risk of bias assessment results for the 45 papers included in this review.

Data synthesis results

Data in included literature

The number of participants in internal datasets varied by orders of magnitude, with each study including 1–776 ovarian cancer patients, and one study including over 10,000 total patients across a range of 32 malignancies15. Most research only used data from the five most common subtypes of ovarian carcinoma, though one recent study included the use of sex cord-stromal tumours16. Only one study explicitly included any prospective data collection, and this was only for a small subset which was not used for external validation17.

As shown in Fig. 3, the number of pathology slides used was often much greater than the number of patients included, with three studies using over 1000 slides from ovarian cancer patients18,19,20. In most of the studies, model development samples were WSIs containing resected or biopsied tissue (34/45), with others using individual tissue microarray (TMA) core images (5/45) or pre-cropped digital pathology images (3/45). Most studies used H&E-stained tissue (33/45) and others used a variety of IHC stains (11/45), with no two papers reporting the use of the same IHC stains. Some studies included multi-modal approaches, using genomics 17,21,22,23,24, proteomics21,24, transcriptomics24, and radiomics17 data alongside histopathological data.

Fig. 3: Number of patients and slides per model.
figure 3

Histograms showing the number of a ovarian cancer patients and b ovarian cancer histopathology slides used in model development. Many of these values are uncertain due to incomplete reporting, as reflected in Table 2.

The most commonly used data source was The Cancer Genome Atlas (TCGA) (18/45), a project from which over 30,000 digital pathology images from 33 malignancies are publicly available. The ovarian cancer subset, TCGA-OV25, contains 1481 WSIs from 590 cases of ovarian serous carcinoma (mostly, but not exclusively, high-grade), with corresponding genomic, transcriptomic, and clinical data. This includes slides from eight data centres in the United States, with most slides containing frozen tissue sections (1374/1481) rather than formalin-fixed, paraffin-embedded (FFPE) sections. Other recurring data sources were the University of British Columbia Ovarian Cancer Research Program (OVCARE) repository26,27,28, the Transcanadian study29,30, and clinical records at the Mayo Clinic31,32, Tri-Service General Hospital33,34,35, and Memorial Sloan Kettering Cancer Center17,36. All other researchers either used a unique data source (12/45) or did not report the provenance of their data (8/45). TCGA-OV, OVCARE, and the Transcanadian study are all multi-centre datasets. Aside from these, few studies reported the use of multi-centre data17,24,28,37,38,39. Only two studies reported the use of multiple slide scanners, with every slide scanned on one of two available scanners27,28. The countries from which data were sourced included Canada, China, Finland, France, Germany, Italy, Japan, the Netherlands, South Korea, Taiwan, the United Kingdom, and the United States of America.

Methods in included literature

There was a total of 80 models of interest in the 45 included papers, with each paper containing 1–6 such models. There were 37 diagnostic models, 22 prognostic models, and 21 other models predicting diagnostically relevant information. Diagnostic model outcomes included the classification of malignancy status (10/37), histological subtype (7/37), primary cancer type (5/37), genetic mutation status (4/37), tumour-stroma reaction level (3/37), grade (2/37), transcriptomic subtype (2/37), stage (1/37), microsatellite instability status (1/37), epithelial-mesenchymal transition status (1/37), and homologous recombination deficiency status (1/37). Prognostic models included the prediction of treatment response (11/23), overall survival (6/23), progression-free survival (3/23), and recurrence (2/23). The other models performed tasks that could be used to assist pathologists in analysing pathology images, including measuring the quantity/intensity of staining, generating segmentation masks, and classifying tissue/cell types.

A variety of models were used, with the most common types being convolutional neural network (CNN) (41/80), support vector machine (SVM) (10/80), and random forest (6/80). CNN architectures included GoogLeNet40, VGG1619,32, VGG1926,28, InceptionV333,34,35,38, ResNet1817,27,28,39,41,42, ResNet3443, ResNet5016,44,45, ResNet18236, and MaskRCNN32. Novel CNNs typically used multiple standardised blocks involving convolutional, normalisation, activation, and/or pooling layers22,46,47, with two studies also including attention modules20,35. One study generated their novel architecture by using a topology optimisation approach on a standard VGG1623.

Most researchers split their original images into patches to be separately processed, with patch sizes ranging from 60×60 to 2048×2048 pixels, the most common being 512×512 pixels (19/56) and 256×256 pixels (12/56). A range of feature extraction techniques were employed, including both hand-crafted/pre-defined features (23/80) and features that were automatically learned by the model (51/80). Hand-crafted features included a plethora of textural, chromatic, and cellular and nuclear morphological features. Hand-crafted features were commonly used as inputs to classical ML methods, such as SVM and random forest models. Learned features were typically extracted using a CNN, which was often also used for classification.

Despite the common use of patches, most models made predictions at the WSI level (29/80), TMA core level (18/80), or patient level (6/80), requiring aggregation of patch-level information. Two distinct aggregation approaches were used, one aggregating before modelling and one aggregating after modelling. The former approach requires the generation of slide-level features before modelling, the latter requires the aggregation of patch-level model outputs to make slide-level predictions. Slide-level features were generated using summation16, averaging21,24,36, attention-based weighted averaging20,41,42,44,45, concatenation15,30, as well as more complex embedding approaches using Fisher vector encoding29 and k-means clustering48. Patch-level model outputs were aggregated to generate slide-level predictions by taking the maximum22,35, median43, or average23, using voting strategies27,34, or using a random forest classifier28. These approaches are all examples of multiple instance learning (MIL), though few models of interest were reported using this terminology22,41,42,44.

Most studies included segmentation at some stage, with many of these analysing tumour/stain segmentation as a model outcome32,36,37,47,49,50,51,52,53,54. Some other studies used segmentation to determine regions of interest for further modelling, either simply separating tissue from background15,18,44,45, or using tumour segmentation to select the most relevant tissue regions33,34,35,55,56. One study also used segmentation to detect individual cells for classification57. Some studies also used segmentation in determining hand-crafted features relating to the quantity and morphology of different tissues, cells, and nuclei17,18,21,24,30,31.

While attention-based approaches have been applied to other malignancies for several years58,59, they were only seen in the most recent ovarian cancer studies20,28,33,34,35,41,42,44,45, and none of the methods included self-attention, an increasingly popular method for other malignancies60. Most models were deterministic, though hidden Markov trees51, probabilistic boosting trees52, and Gaussian mixture models61 were also used. Aside from the common use of low-resolution images to detect and remove non-tissue areas, images were typically analysed at a single resolution, with only six papers including multi-magnification techniques in their models of interest. Four of these combined features from different resolutions for modelling29,30,36,48, and the other two used different magnifications for selecting informative tissue regions and for modelling33,34. Out of the papers for which it could be determined, the most common modelling magnifications were ×20 (35/41) and ×40 (7/41). Few models integrated histopathology data with other modalities (6/80). Multi-modal approaches included the concatenation of separately extracted uni-modal features before modelling21,23,24, the amalgamation of uni-modal predictions from separate models17, and a teacher–student approach where multiple modalities were used in model training but only histopathology data was used for prediction22.

Analysis in included literature

Analyses were limited, with less than half of the model outcomes being evaluated with cross-validation (39/80) and with very few externally validated using independent ovarian cancer data (7/80), despite small internal cohort sizes. Cross-validation methods included k-fold (22/39) with 3–10 folds, Monte Carlo (12/39) with 3–15 repeats, and leave-one-patient-out cross-validations (5/39). Some other papers included cross-validation on the training set to select hyperparameters but used only a small unseen test set from the same data source for evaluation. Externally validated models were all trained with WSIs, with validations either performed on TMA cores (2/7) or WSIs from independent data sources (5/7), with two of these explicitly using different scanners to digitise internal and external data27,28. Some reported methods were externally validated with data from non-ovarian malignancies, but none of these included ovarian cancer data in any capacity, so were not included in the review. However, there was one method which trained with only gastrointestinal tumour data and externally validated with ovarian tumour data16.

Most classification models were evaluated using accuracy, balanced accuracy, and/or area under the receiver operating characteristic curve (AUC), with one exception where only a p-value was reported measuring the association between histological features and transcriptomic subtypes based on a Kruskal–Wallis test19. Some models were also evaluated using the F1-score, which we chose not to tabulate (in Fig. 3) as the other metrics were reported more consistently. Survival model performance was typically reported using AUC, with other metrics including p-value, accuracy, hazard ratios, and C-index, which is similar to AUC but can account for censoring. Segmentation models were almost all evaluated differently from each other, with different studies reporting AUC, accuracy, Dice coefficient, intersection over union, sensitivity, specificity, and qualitative evaluations. Regression models were all evaluated using the coefficient of determination (R2-statistic). For some models, performance was broken down per patient39,61, per subtype16, or per class15,24,32,57, without an aggregated, holistic measure of model performance.

The variability of model performance was not frequently reported (33/94), and when it was reported it was often incomplete. This included cases where it was unclear what the intervals represented (95% confidence interval, one standard deviation, variation, etc.), or not clear what the exact bounds of the interval were due to results being plotted but not explicitly stated. Within the entire review, there were only three examples in which variability was reported during external validation27,38,39, only one of which clearly reported both the bounds and the type of the interval38. No studies performed any Bayesian form of uncertainty quantification. Reported results are shown in Table 2, though direct comparisons between the performance of different models should be treated with caution due to the diversity of data and validation methods used to evaluate different models, the lack of variability measures, the consistently high risks of bias, and the heterogeneity in reported metrics.

Discussion

The vast majority of published research on AI for diagnostic or prognostic purposes in ovarian cancer histopathology was found to be at a high risk of bias due to issues within the analyses performed. Researchers often used a limited quantity of data and conducted analyses on a single train-test data split without using any methods to account for overfitting and model optimism (cross-validation, bootstrapping, external validation). These limitations are common in gynaecological AI research using other data types, with recent reviews pointing to poor clinical utility caused by predominantly retrospective studies using limited data62,63 and limited methodologies with weak validation, which risk model performance being overestimated64,65.

The more robust analyses included one study in which several relevant metrics were evaluated using 10 repeats of Monte Carlo cross-validation on a set of 406 WSIs, with standard deviations reported for each metric26. Other positive examples included the use of both internal cross-validation and external validation for the same outcome, giving a more rigorous analysis28,34,39. While external validations were uncommon, those which were conducted offered a real insight into model generalisability, with a clear reduction in performance on all external validation sets except one28. The only study which demonstrated high generalisability included the largest training set out of all externally validated approaches, included more extensive data labelling than many similar studies, and implemented a combination of three colour normalisation approaches, indicating that these factors may benefit generalisability.

Studies frequently had an unclear risk of bias within the participants and predictors domains of PROBAST due to incomplete reporting. Frequently missing information included where the patients were recruited, how many patients were included, how many samples/images were used, whether any patients/images were excluded, and the methods by which tissue was processed and digitised. Reporting was often poor regarding open-access datasets. Only three papers were found to be at low risk of bias for participants, with these including clear and reasonable patient recruitment strategies and selection criteria, which can be seen as positive examples for other researchers37,43,56. Information about the predictors (histopathology images and features derived thereof) was generally better reported, but still often missed key details which meant that it was unclear whether all tissue samples were processed similarly to avoid risks of bias from visual heterogeneity. It was found that when patient characteristics were reported, they often showed a high risk of bias. Many studies included very small quantities of patients with specific differences from the majority (e.g. less than 20 patients with a different cancer subtype to the majority), causing a risk of spurious correlations and results which are not generalisable to the wider population.

Reporting was particularly sparse in studies which used openly accessible data, possibly indicating that AI-focused researchers were not taking sufficient time to understand these datasets and ensure their research was clinically relevant. For example, many of the researchers who used TCGA data included frozen tissue sections without commenting on whether this was appropriate, despite the fact that pathologists do not consider them to be of optimal diagnostic quality. One paper handled TCGA data more appropriately, with a clear explanation of the positives and negatives of the dataset, and entirely separate models for FFPE and frozen tissue slides15.

Sharing code can help to mitigate the effects of incomplete reporting and drastically improve reproducibility, but only 19 of the 45 papers did this, with some of these appearing to be incomplete or inaccessible. The better code repositories included detailed documentation to aid reproducibility, including environment set-up information16,19, overviews of included functions17,36,42, and code examples used to generate reported results57.

Two papers were found to have major discrepancies between the reported data and the study design, indicating much greater risks of bias than those seen in any other research46,54. In one paper46, it was reported that TCGA-OV data was used for subtyping with 5 classes, despite this dataset only including high-grade serous and low-grade serous carcinomas. In the other paper54, it was reported that TCGA-OV data was used for slide-level classification into ovarian cancer and non-ovarian cancer classes using PAS-stained tissue, despite TCGA-OV only containing H&E-stained ovarian cancer slides.

Limitations of the review

While the review protocol was designed to reduce biases and maximise the quantity of relevant research included, there were some limitations. This review is restricted to published literature in the English language, however, AI research may be published in other languages or made available as pre-prints without publication in peer-reviewed journals, making this review incomplete. While most of the review process was completed by multiple independent researchers, the duplicate detection was performed by only a single researcher, raising the possibility of errors in this step of the review process, resulting in incorrect exclusions. Due to the significant time gap between the initial and final literature searches (approximately 12 months), there may have been inconsistencies in interpretations, both for data extraction and risk of bias assessments. Finally, this review focused only on light microscopy images of human histopathology samples relating to ovarian cancer, so may have overlooked useful literature outside of this domain.

Development of the field

The field of AI in ovarian cancer histopathology diagnosis is rapidly growing, with more research published since the start of 2020 than in all preceding years combined. The earliest research, published between 2010 and 2013, used hand-crafted features to train classical ML methods such as SVMs. These models were used for segmentation49,50,51,53, malignancy classification18,52, grading21, and overall survival prediction21. Most of these early studies focused on IHC-stained tissue (5/7), which would be much less commonly used in subsequent research (6/38).

The field was relatively dormant in the following years, with only 6 papers published between 2014 and 2019, half of which had the same primary author29,30,48. These models still used traditional ML classifiers, though some used learned features rather than the traditional hand-crafted features. The models developed were used for histological subtyping29,30,48 and cellular/tissue classification40,57,66.

Since 2020, there has been a much greater volume of research published, most of which has involved the use of deep neural networks for automatic feature extraction and classification, with a minority using traditional machine learning model17,24,31,61,67. Recent research has investigated a broader array of diagnostic outcomes, including the classification of primary cancer type15,54, mutation status24,36,44, homologous recombination deficiency status42, tumour–stroma reaction level32, transcriptomic subtypes19,24, microsatellite instability24, and epithelial-mesenchymal transition status41. Three additional prognostic outcomes have also been predicted in more recent literature—progression-free survival17,43,56, relapse43,44, and treatment response19,22,33,34,35.

Despite progress within a few specific outcomes, there was no obvious overall trend in the sizes of datasets used over time, either in terms of the number of slides or the number of participants. Similarly, there was no evidence that recent research included more rigorous internal validations, though external validations have been increasing in frequency—no research before 2021 included any external validation with ovarian cancer data, but seven studies published more recently did16,24,27,28,34,38,39. While these external validations were typically limited to small quantities of data, the inclusion of any external validation demonstrates progress from previous research. Such validations are essential to the clinical utility of these models as real-world implementation will require robustness to different sources of visual heterogeneity, with variation occurring across different data centres and within data centres over time. As this field continues to mature, we hope to see more studies conduct thorough validations with larger, high-quality independent datasets, including clearly reported protocols for patient recruitment and selection, pathology slide creation, and digitisation. This will help to reduce the biases, limited reproducibility, and limited generalisability identified in most of the existing research in this domain.

Current limitations and future recommendations

A large proportion of published work did not provide sufficient clinical and pathological information to assess the risk of bias. It is important that AI researchers thoroughly report data provenance to understand the extent of heterogeneity in the dataset, and to understand whether this has been appropriately accounted for in the study design. Modelling and analysis methods must also be thoroughly reported to improve reliability and reproducibility. Researchers may find it useful to refer to reporting checklists, such as transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)68, to ensure that they have understood and reported all relevant details of their studies. In many studies, it is not clear how AI would fit in the clinical workflow, or whether there are limitations in how these methods could be applied. AI researchers should ensure they understand the clinical context of their data and potential models before undertaking research to reduce bias and increase utility. Ideally, this will involve regular interactions with expert clinicians, including histopathologists and oncologists.

To further improve reproducibility, we recommend that researchers should make code and data available where possible. It is relatively easy to publish code and generate documentation to enhance usability, and there are few drawbacks to doing so when publishing research. Making data available is more often difficult due to data security requirements and the potential storage costs, but it can provide benefits beyond the primary research of the original authors. Digital pathology research in ovarian cancer is currently limited by the lack of openly accessible data, leading to over-dependence on TCGA, and causing many researchers to painstakingly collate similar but distinct datasets. These datasets often contain little of the heterogeneity seen in multi-centre, multi-scanner data, making it difficult for researchers to train robust models or assess generalisability. Where heterogeneous data is included, it often includes small quantities of data which are different to the majority, introducing risks of bias and confounding rather than helping to overcome these issues. TCGA-based studies are prone to this, with significant differences between TCGA slides originating from different data centres69, but with many of these centres only providing small quantities of data. Many researchers are reliant on open-access data, but there is a severe shortage of suitable open-access ovarian cancer histopathology data. Making such data available, with detailed protocols describing data creation, allows researchers to conduct more thorough analyses and significantly improve model generalisability and clinical implementability.

For AI to achieve clinical utility, it is essential that more robust validations are performed, especially considering the limitations of the available datasets. We recommend that researchers should always conduct thorough analyses, using cross-validation, bootstrapping, and/or external validations to ensure that results are robust and truly reflect the ability of their model(s) to generalise to unseen data, and are not simply caused by chance. This should include reporting the variability of results (typically in a 95% confidence interval), especially when comparing multiple models to help to distinguish whether one model is genuinely better than another or whether the difference is due to chance. Statistical tests can also be beneficial for these evaluations. Another option for capturing variability is Bayesian uncertainty quantification, which can be used to separate aleatoric (inherent) and epistemic (modelling) uncertainty.

Current literature in this field can be largely characterised as model prototyping with homogeneous retrospective data. Researchers rarely consider the reality of human-machine interaction, perhaps believing that these models are a drop-in replacement for pathologists. However, these models perform narrow tasks within the pathology pipeline and do not take into consideration the clinical context beyond their limited training datasets and siloed tasks. We believe these models would be more beneficial (and more realistic to implement) as assistive tools for pathologists, providing secondary opinions or novel ancillary information. While current research is typically focused on assessing model accuracy without any pathologist input, different study designs could be employed to better assess the real-world utility of these models as assistive tools. For example, usability studies could investigate which models are most accessible and most informative to pathologists in practice, and prospective studies could quantify any benefits to diagnostic efficiency and patient outcomes, and investigate the robustness of models in practice. Understanding the effects of AI on the efficiency of diagnosis is particularly important given the limited supply of pathologists worldwide. As such, this type of research will significantly benefit clinical translation.

Summary of recommendations

To improve clinical utility, researchers should understand their data and ensure planned research is clinically relevant before any modelling, ideally involving clinicians throughout the project. They should also consider different study designs, including usability studies and/or prospective studies. When evaluating models, researchers should conduct thorough analyses using cross-validation, external validation, and/or bootstrapping. When reporting research, researchers should clearly report the context of any histopathology data, including how patients were recruited/selected, and how tissue specimens were processed to generate digital pathology images. Finally, researchers should make all code openly accessible, and make data available where possible.

Methods

Literature search

Searches were conducted in three research databases, PubMed, Scopus and Web of Science, and two trial registries, Cochrane Central Register of Controlled Trials (CENTRAL) and the World Health Organisation International Clinical Trial Registry Platform (WHO-ICTRP). The research databases only include journals and conference proceedings which have undergone peer review, ensuring the integrity of included research. The initial searches were performed on 25/04/2022 and were most recently repeated on 19/05/2023. The search strategy was composed of three distinct aspects—artificial intelligence, ovarian cancer, and histopathology. For each aspect, multiple relevant terms were combined using the OR operator (e.g. “artificial intelligence” OR “machine learning”), and then these were combined using the AND operator to ensure that retrieved research met all three aspects. The widest possible set of search fields was used for each search engine except for Scopus, where restrictions were imposed to avoid searching within the citation list of each article, which is not an available field in the other search engines. The terms “ML” and “AI” were restricted to specific fields due to the diversity of their possible meanings. To ensure the most rigorous literature search possible, no restrictions were placed on the publication date or article type during searching.

Many AI approaches build on statistical models, such as logistic regression, which can blur the lines between disciplines. When conducting searches, a previously reported methodology was adopted70 whereby typical AI approaches were searched by name (e.g. neural networks), and other methods were searched by whether the authors described their work as artificial intelligence. Full details of the search implementation for each database are provided in Supplementary Note 1. The review protocol was registered with PROSPERO before the search results were screened for inclusion (CRD42022334730).

Literature selection

One researcher (J.B.) manually removed duplicate papers with the assistance of the referencing software EndNote X9. Two researchers (J.B., K.A.) then independently screened articles for inclusion in two stages, the first based on title and abstract, the second based on full text. Disagreements were discussed and arbitrated by a third researcher (N.R. or N.M.O.). Trials in WHO-ICTRP do not have associated abstracts, so for these studies, only titles were available for initial screening.

The inclusion criteria required that research evaluated the use of at least one AI approach to make diagnostic or prognostic inferences on human histopathology images from suspected or confirmed cases of ovarian cancer. Studies were only included where AI methods were applied directly to the digital pathology images, or to features which were automatically extracted from the images. Fundamental tasks, such as segmentation and cell counting, were included as these could be used by pathologists for computer-aided diagnosis. Only conventional light microscopy images were considered, with other imaging modalities, such as fluorescence and hyperspectral imaging, excluded. Publications which did not include primary research were excluded (such as review papers). Non-English language articles and research where a full version of the manuscript was not accessible were excluded.

A model in an included study was considered to be a model of interest if it met the same inclusion criteria. Where multiple models were compared against the same outcome, the model of interest was taken to be the newly proposed model, with the best performing model during validation taken if this was unclear. If multiple model outcomes were assessed in the same study, a model of interest was taken for each model outcome, regardless of any similarity in modelling approaches. The same model outcome at different levels of precision (e.g. patch-level, slide-level, patient-level) were not considered to be different model outcomes. Models did not need to be entirely independent, for example, the output of one model of interest could have been used as the input of another model of interest on the condition that model performance was separately evaluated for each model.

Risk of bias assessment

The risk of bias was assessed for models of interest using the Prediction model Risk Of Bias ASsessment Tool (PROBAST)71, where risk of bias is the chance of reported results being distorted by limitations within the study design, conduct, and analysis. It includes 20 guiding questions which are categorised into four domains (participants, predictors, outcome, and analysis), which are summarised as either high-risk or low-risk, or unclear in the case that there is insufficient information to make a comprehensive assessment and none of the available information indicates a high risk of bias. As such, an unclear risk of bias does not indicate methodological flaws, but incomplete reporting.

The participants domain covers the recruitment and selection of participants to ensure the study population is consistent and representative of the target population. Relevant details include the participant recruitment strategy (when and where participants were recruited), the inclusion criteria, and how many participants were recruited.

The predictors domain covers the consistent definition and measurement of predictors, which in this field typically refers to the generation of digital pathology images. This includes methods for fixing, staining, scanning, and digitally processing tissue before modelling.

The outcome domain covers the appropriate definition and consistent determination of ground-truth labels. This includes the criteria used to determine diagnosis/prognosis, the expertise of any persons determining these labels, and whether labels are determined independently of any model outputs.

The analysis domain covers statistical considerations in the evaluation of model performance to ensure valid and not unduly optimistic results. This includes many factors, such as the number of participants in the test set with each outcome, the validation approaches used (cross-validation, external validation, bootstrapping, etc.), the metrics used to assess performance, and methods used to overcome the effects of censoring, competing risks/confounders, and missing data. The risks caused by some of these factors are interrelated, for example, the risk of bias from using a small dataset is somewhat mitigated by cross-validation, which increases the effective size of the test set and can be used to assess variability, reducing optimism in the results. Further, the risk caused by using a small dataset depends on the type of outcome being predicted, for example, more data is required for a robust analysis of 5-class classification than binary classification. There must also be sufficient data within all relevant patient subgroups, for example, if multiple subtypes of ovarian cancer are included, there must not be a subtype that is only represented by a few patients. Due to these interrelated factors, there are no strict criteria to determine the appropriate size of a dataset, though fewer than 50 samples per class or fewer than 100 samples overall is likely to be considered high-risk, and more than 1000 samples overall is likely to be considered low-risk.

Risks of bias often arise due to inconsistent methodologies. Inconsistency in the participants and predictors domains may cause heterogeneity in the visual properties of digital pathology slides which may lead to spurious correlations, either through random chance or systematic differences between subgroups in the dataset. Varied data may be beneficial during training to improve model generalisability when using large datasets, though this must be closely controlled to avoid introducing systematic confounding. Inconsistent determination of the outcome can mean that the results of a study are unreliable due to spurious correlations in the ground truth labels, or invalid due to incorrect determination of labels.

While PROBAST provides a framework to assess risks of bias, there is some level of subjectivity in the interpretation of signalling questions. As such, each model was analysed by three independent researchers (any of J.B., K.A., N.R., K.Z., N.M.O.), with at least one computer scientist and one clinician involved in the risk of bias assessment for each model. The PROBAST applicability of research analysis was not implemented as it is unsuitable for such a diverse array of possible research questions.

Data synthesis

Data extraction was performed independently by two researchers (J.B., K.A.) using a form containing 81 fields within the categories Overview, Data, Methods, Results, and Miscellaneous. Several of these fields were added or clarified during data extraction with the agreement of both researchers and retroactively applied to all accepted literature. The final data extraction form is available at www.github.com/scjjb/OvCaReview, and is summarised in Supplementary Table 1.

Information was sought from full-text articles, as well as references and supplementary materials where appropriate. Inferences were made only when both researchers were confident that this gave the correct information, with disagreements resolved through discussion. Fields which could not be confidently completed were labelled as being unclear.

All extracted data were summarised in two tables, one each for study-level and model-level characteristics. Only models of interest were included in these tables. The term model outcome refers to the model output, whether this was a clinical outcome (diagnosis/prognosis), or a diagnostically relevant outcome that could be used for computer-aided diagnosis, such as tumour segmentation. The data synthesis did not include any meta-analysis due to the diversity of included methods and model outcomes. The PRISMA 2020 guidelines for reporting systematic reviews were followed, with checklists provided in Supplementary Tables 2 and 3.