Germline BRCA 1-2 status prediction through ovarian ultrasound images radiogenomics: a hypothesis generating study (PROBE study)

Radiogenomics is a specific application of radiomics where imaging features are linked to genomic profiles. We aim to develop a radiogenomics model based on ovarian US images for predicting germline BRCA1/2 gene status in women with healthy ovaries. From January 2013 to December 2017 a total of 255 patients addressed to germline BRCA1/2 testing and pelvic US documenting normal ovaries, were retrospectively included. Feature selection for univariate analysis was carried out via correlation analysis. Multivariable analysis for classification of germline BRCA1/2 status was then carried out via logistic regression, support vector machine, ensemble of decision trees and automated machine learning pipelines. Data were split into a training (75%) and a testing (25%) set. The four strategies obtained a similar performance in terms of accuracy on the testing set (from 0.54 of logistic regression to 0.64 of the auto-machine learning pipeline). Data coming from one of the tested US machine showed generally higher performances, particularly with the auto-machine learning pipeline (testing set specificity 0.87, negative predictive value 0.73, accuracy value 0.72 and 0.79 on training set). The study shows that a radiogenomics model on machine learning techniques is feasible and potentially useful for predicting gBRCA1/2 status in women with healthy ovaries.


Methods
Study design. This is an observational, single center study with patients retrospectively enrolled at the Fondazione Policlinico Universitario Agostino Gemelli IRCCS of Rome, Italy from January 2013 to December 2017.
A radiomics machine learning approach, based on features extracted from US images of the ovaries, was proposed to predict gBRCA1/2 carrier status.
The retrospective data on gBRCA1/2 testing performed on patients with NGS technique was considered as the reference standard of the radiomics analysis.
Transvaginal pelvic US was performed by dedicated gynecologic sonographers according to current international indications 32,33 .
This study was conducted in accordance with the declaration of Helsinki and was approved by the the Fondazione Policlinico Universitario Agostino Gemelli IRCCS ethics committee (Prot. 50543/19; ID: 2907), with the requirement for informed consent.
Study population. The inclusion criteria were defined as follows: (i) patients recommended for BRCA testing according to international guidelines 7,17 and (ii) patients who underwent gBRCA1/2 testing in our institution (iii) and patients who underwent transvaginal pelvic US performed in our institution providing at least one picture of one healthy ovary and (iv) patient's images had to be stored in .dicom format.
The exclusion criteria were (i) personal diagnosis of OC or (ii) ovarian abnormal findings (eg. ovarian endometriomas, dermoid cyst, ovarian cystoadenofibroma, ovarian borderline tumour…) but functional cysts at pelvic US or (iii) unavailable results of gBRCA1/2 testing or (iv) refusal to provide written informed consent. gBRCA1/2 testing data acquisition. Recommendations for genetic counseling were provided according to international guidelines 7,17 .
All patients included in the study received oncogenetic counseling before BRCA testing and signed written informed consent for BRCA analysis. Standardized procedures according to previously published workflows were observed 34,35 .
DNA for genotyping was isolated from whole blood samples by a non-automated method based on a commercial kit (Roche Diagnostics, Basel, Switzerland), and spectral analysis was performed using a NanoPhotometer (Implen, München, Germany). In order to achieve the highest efficiency in multiplex polymerase chain reaction (PCR), DNA was extracted just before the amplification step.
In addition, horizon reference material was also used 36 and the bioinformatics pipeline was validated using reference materials (EMQN scheme) 37 .
As previously reported, the bioinformatic "Amplicon Suite" CE-IVD tool (SmartSeq s.r.l., Novara, Italy) was used to determine depth of region covered, copy number status and variant calling 38 .
Variants with disrupting effects, such as frameshift, nonsense, canonical-splice site and such missense, along with large rearrangements, which were overall unequivocally reported as "pathogenic (class 5)", within the main databases based on multiple lines of evidence and expert reviewing board (like ENIGMA Consortium) and the International Agency for research on Cancer (IARC) annotation, were assigned to the gBRCA1/2 PV group. www.nature.com/scientificreports/ All DNA samples resulting as negative for gBRCA1/2 alterations at the massive parallel sequencing, were further analyzed by multiplex ligation dependent probe amplification (MLPA) or multiplex amplicon quantification assays 39 .
All amplicons with coverage under 38x, and all gBRCA1/2 variants that were classified as damaging (class 5), probably damaging (class 4) and variants of uncertain significance (class 3, VUS) were confirmed by targeted Sanger sequencing. Finally, all positive results detected by MLPA on two independent DNA samples were confirmed by long-range PCR (Expand Long Template PCR System, Roche Applied Science) 40 .
Pelvic US imaging protocol. All pelvic US examinations were performed by dedicated gynecologic oncologists with more than 10 years of experience in gynecological ultrasound. The examinations were carried out using 4 different high quality ultrasound equipment (GE Healthcare Voluson E10, Canon Toshiba Aplio-i900, Samsung Elite, Esaote-My LabTM Twice). Only transvaginal US probes were used to acquire images (7-10 MHz) using a standardized examination technique 41 .
When performing ovarian examination with real-time 2D-US, the patient was positioned in lithotomy position, with empty bladder and the operator scanned the ovary in both axial and coronal planes, to identify the best image.
The ovary had to occupy at least 50% of the screen along its largest axis. Three orthogonal maximum diameters of the ovary (total length; antero-posterior measurement on the same scan; transverse length in a perpendicular image) for a maximum of 2 images per ovary were required, for a maximum of 2 images per ovary if possible. Current international indications were followed by all sonographers 32,33 . Other ultrasound features described were: morphology, echostructure and eventually abnormal vascularitation. The presence of functional cysts or the absence of one ovary, were was not considered an exclusion criterion. The presence of ovarian tumors, benign, uncertain or suspicious for malignant, according to the IOTA classification led to the exclusion from the current analysis 41 . In case of artifact, such as bowel gas shadowing, the patients were excluded if the images were not clear for analysis. Image postprocessing. Code for the project is available at https ://githu b.com/kbola b/PROBE . The repository contains the scripts for all the different part of the analysis, from ROI extraction, to feature computation, to statistical analysis. Everything regarding radiomics processing is a direct porting of the R library Moddicom.
The regions of interest (ROI) were manually delineated on the US image via the software Aliza version 1.48 by a trained gynecologist (Fig. 1) 42 .
Guidelines describing how to define the boundaries of the ovary were provided before starting ROI delineation and two senior dedicated gynecologic sonographers with at least 10 years' experience in pelvic US imaging were responsible for the evaluation and independent check of ovarian segmentation.
Since the BRCA status outcome was associated to the patient, and each patient could be associated with a number of images varying from 1 to 4, the corresponding DICOM folder structure was organized in a hierarchical way so that a single ROI was labeled with the following set of tags: US machine, patient ID, left or right ovary, longitudinal or coronal plane. Each leaf folder structure contained therefore two files: the original DICOM and the ROI mask DICOM.
All the image files had been previously anonymized according to the existing privacy regulations. www.nature.com/scientificreports/ US radiomics feature extraction and validation. The original DICOM file and the corresponding ROI segmentation masks were uploaded on the MODDICOM platform, an in-house developed software for radiomics analysis 43 . The software extracted the pixel values contained inside the ROI boundaries and computed 232 radiomics features. Prior to feature extraction, all the ROIs were rescaled to pixel value histograms between 0 and 255.
After that, four groups of imaging features were extracted from each normalized US image with manually segmented ROIs: 20 first order statistics, 14 morphological, 183 texture and 15 fractal features. First order statistics features describe the properties of the image grey level histogram considered as a whole. Morphological features describe the properties of the ROI in space. Texture features describe the properties of the local distribution of grey levels inside the ROI. Fractal features compute the fractal dimension through the box-counting algorithm. The software implementation, was entirely validated within the Image Biomarker Standardization Initiative 44 to ensure reproducibility and methodological robustness of the radiomics features extraction pipeline.
Statistical analysis. Feature-outcome association in univariate analysis. Radiomics features are often strongly correlated with each other, so that one can identify clusters of similarly distributed features through the pair-wise Pearson correlation test.
This approach is useful to reduce features dimensionality; the higher the number of features tested against an outcome, the higher the probability of over-fitting the data or getting a significant result just by chance will be 45 .
The feature selection for univariate analysis was carried out via correlation analysis to remove all but one feature among groups of highly correlated features by setting a Pearson correlation threshold of 0.9. The remaining features were tested for association with the outcome with the Wilcoxon-Mann-Whitney test. A p-value of 0.05 was considered as statistically significant.
The number of significant test results was then compared to the expected number of type I errors to account for multiple testing 46 .
Machine learning-based models. Multivariable analysis for classification of gBRCA1/2 status was carried out via four strategies involving machine learning techniques: logistic regression (A), support vector machine (B), ensemble of decision trees (C) and automated machine learning pipelines (D). The corresponding models were trained on the 75% of the data, while the remaining 25% of data was left out for testing purposes.
A first feature selection was carried out for each strategy with correlation analysis to remove all but one highly correlated feature by setting a Pearson correlation threshold of 0.9, as performed in the univariate analysis.
Strategies A and B involved logistic regression and radial kernel support vector machine, respectively: both strategies had a forward feature selection based Cohen Kappa values at the 0.75 splitted training set ROC Youden index.
We used Cohen's Kappa as confusion matrix evaluation metrics to take into account the class imbalance and the related fact that expected accuracy is "skewed" towards values higher than 50%. Also, for the same reason, this metric allows for direct comparison between results obtained on datasets with different class distribution, so that for future work on external validation or prospective data the classification performances will be directly comparable with the ones reported in this paper.
Detailed steps of these iterative strategies, starting from n equals to 1, are: i. All the possible different models with n features are trained using all the different groups of n features; ii. The ROC curve on the training set is computed for every n feature model; iii. The optimal cut-off point according to Youden index will be found on every ROC curve of ii) and the Cohen Kappa value of the classification matrix at the optimal point will be computed; iv. The model with the highest value of the Cohen Kappa computed in iii) is then held for the next iteration, while all the others are discarded; v. All the possible different models with n + 1 features are trained adding to the n features selected in the previous step, all the features left available, one by one; vi. The process is then repeated until the Cohen's Kappa value does not increase by adding a new feature to the model. 46 with extensive grid search on the hyperparameters space and fivefold cross validation error on the 0.75 split training set as the metric to be minimized. The selected model was then tested on the remaining 25% of data.
Auto Machine Learning for classification with TPOT Python library 47 has been used in strategy D, testing the spectrum of classification models and their hyperparameters with a heuristic search conducted through a genetic algorithm with the fivefold cross validation error on the 0.75 splitted training set as the metric to be minimized.
The genetic algorithm was initialized with TPOT default parameters. The selected model was then tested on the remaining 25% of data.
All statistical analyses were conducted with R version 3.4.4 48 and Python version 3.7.4 49 .

Results
Clinical characteristics of the study cohort. Moreover, 684 patients out of 1756 underwent pelvic US at our institution. Pelvic US was part of the annual gynecologic assessment worldwide recommended for the optimal promotion of women's health 50 .
Patients who did not have US images stored as .dicom files and those with abnormal findings regarding the ovaries at US imaging were excluded from the study. Even in cases of unilateral lesion, we preferred not to use for analysis the contra lateral ovary in order to reduce possible bias.
A total of 890 images was finally analyzed in the current study. Figure 2 illustrates the general flowchart of the study. Table 1 reports the clinical characteristics of the 255 included women; no significant differences were found in most clinico-pathologic factors between the testing and training sets.
Overall, the mean age was 45.5 years but the majority of women (62%) were in menopause (both iatrogenic or natural). This is consistent with the fact that the indication to the test was mainly related to personal history of BC (80%). The mean number of images per patient was 3.5, ensuring quality to US approach.
No significant differences were observed in the two image sets.
Correlation and univariate analysis. After the segmentation of the ROIs and the following feature extraction, the final feature set was composed by 232 radiomics features. The outcome was binarized into gBRCA1/2 WT patients (558 images) and gBRCA1/2-ve patients (332 images). gBRCA 1/2 VUS were considered among gBRCA 1/2 PVs because although the pathogenetic potential of this group of mutations is unknown, these patients deserve to be tested and followed up.
A specific dataset for each US machine was also produced for the analysis with the following number of images: Voluson (350 images), Toshiba (190 images), Samsung (221 images), Esaote (134 images).
The proportion of g-BRCA 1-2 pathogenetic cases was comparable in all the datasets. A first feature selection was carried out via correlation analysis on each dataset to reduce the feature space at a Pearson correlation threshold of 0.9, resulting in 61 features for the full dataset, 45 for Voluson, 42 for Toshiba, 44 for Samsung and 40 for Esaote.
The overlapping features identified from each machine after Pearson correlation analysis are shown in Supplementary Table 1.
The radiomics features in these sorted datasets were tested individually in univariate analysis using Wilcoxon-Mann-Whitney test with the binary outcome.
The features which resulted in a statistically significant test were 6 for the full dataset, 16 for the Voluson dataset, 3 for the Toshiba dataset, 3 for the Samsung dataset and 2 for the Esaote dataset. As a percentage of the tested features, these figures represent the 9.8%, 35.0%, 7.1%, 6.8% and 5.0% respectively (as shown in Table 2).
The overlapping statistically significant features identified from each machine after Wilcoxon-Mann-Whitney test, are shown in Supplementary Table 2. www.nature.com/scientificreports/ Since the significance threshold was set at 0.05, we observed that the number of significant features was above the expected value of type I errors in four out five datasets, with a particularly high rate of statistically significant associations for the features coming from Voluson images.
The p-values of the Wilcoxon-Mann-Whitney test for the significant features are reported in Table 3. The vast majority of statistically significant features were texture features: 15 out of 16 for Voluson; 2 out of 2 for Esaote; 2 out of 3 for Samsung; 2 out of 3 for Toshiba. Among these, the most selected texture type of features were co-occurrence matrix and run-length matrix based features.
Performance of machine learning models. Multivariable analysis for gBRCA pathogenetic variant classification was carried out with the four described machine learning strategies (A), (B), (C), and (D). Classifica- www.nature.com/scientificreports/ tion metrics for the different models on the five datasets are reported in Supplementary Table 3 and summarized in Fig. 3. The train-test split was per patient so to prevent information leakage from training to testing set. After the split, the model was trained and tested per image, and its diagnostic performance is reported on an image basis.
Considering the full dataset, the four strategies obtained a similar performance in terms of classification accuracy on the testing set, varying from 0.54 of logistic regression of strategy A to 0.64 of strategy D with the auto-machine learning pipeline.
Strategy D showed also the highest value of specificity on the testing set (0.91) and a negative predictive value of 0.65, comparable to the one obtained with strategy C but lower than the corresponding values obtained with strategies A and B, which were 0.74 and 0.80, respectively.
Considering the Voluson data only, performances were generally higher: in particular, testing set specificity and negative predictive value reached 0.77 and 0.81, respectively, with the radial kernel support vector machine of strategy B; 0.82 and 0.70 with the extreme gradient boosting of strategy C, and 0.87 and 0.73 with the automachine learning pipeline of strategy D.
The latter combination, strategy D on Voluson data, showed also the best consistency between training and testing set accuracy (0.79 and 0.72 respectively) suggesting that the adopted strategy performance was particularly robust to avoid data overfitting, which might improve the classification performances on larger image numbers to obtain both a higher specificity and a higher negative predictive value.

Discussion
In this single center study, we developed an automated machine learning pipeline model with encouraging performances to identify gBRCA1/2 status based on US images of healthy ovaries, acquired on different US machines.
This result can be considered as an instance of "real world data" from the perspective of radiomics analysis and paves the way to possible developments of this model with direct consequences on patients diagnostic workflow and prognostic stratification, in the frame of the most modern personalized medicine paradigms. Table 3. Radiomics features whose association test with the binary outcome was statistically significative with corresponding p-value, for the different datasets. www.nature.com/scientificreports/ More specifically, the described model, as a large-scale screening test, would assign only 13 out of 100 gBRCA1/2 WT patients (with the current value of specificity at 0.87) to unnecessary genetic screening.
In our series, 157 out of 255 patients (61.5%) were assigned to genetic screening based on current clinical selection criteria, but resulted as gBRCA1/2 WT. Not many data have been published on the performance of the current clinical criteria selection in detecting BRCA1/BRCA2 mutations. Nevertheless, as reported by Manchanda et al. 18 , despite 25 years of BRCA testing and effective mechanisms for prevention, current guidelines and access to testing/treatment pathways remain complex and associated with a massive under-utilisation of genetic testing. Only 20% of eligible US women have accessed/ undergone genetic testing. A UK study showed the large majority (> 97%) of BRCA carriers in the population remain unidentified 51 . In addition, current approaches use established clinical-criteria/family-history based a priori BRCA probability thresholds to identify high-risk individuals eligible for BRCA testing. The estimation of likelihood of being carrier of BRCA1/2 PV is based on different tools, with possible selection biases related to different algorithms performing the risk assessment: in such countries the threshold is 20% while in others 10%. Over 50% BRCA carriers do not fulfil clinical criteria and are therefore missed 52 .
Our model, when tested and validated on the general population, could be an extremely cost-effective strategy for a population-based BRCA mutation screening. Nevertheless, the current described negative predictive value of 0.73 implies that 27 out of 100 gBRCA1/2 PV carriers would not be assigned to the needed genetic test, probably representing the most significant limit of this hypothesis generating radiogenomics approach.
Compared to CT or MRI, radiomics on US images is a relatively new and less explored type of analysis. Nonetheless, it is gaining gound [53][54][55][56] .
Most significant features for the screening tool both in univariate analysis and in the machine learning models are texture features, which are often not quite easy to visualize. Nevertheless, among the most influential features (especially considering the Voluson dataset) those related to clusters of high-intensity grey levels are predominant. In particular, the "short run high grey level emphasis" feature is the leading feature in the xgboost model with roughly twice the "gain" value than the second most important feature. Also, the univariate statistical association of this feature with the binary outcome through Wilcoxon-Mann-Whitney test is significative with a www.nature.com/scientificreports/ p-value of order 10 −5 , the sign of the effect being that BRCA mutated group have an higher feature median value (i.e. BRCA mutated images have more short runs, or small clusters, of high grey level intensities). As previously discussed, Voluson machine showed more significant features and higher performances compared with the other machines. The explanation may lie in the fact that Voluson dataset has 60% more records than the biggest among Samsung, Esaote, and Toshiba datasets (which is the Samsung dataset with 260 records). This can cause stronger statistical associations if the effect is there. However, more data with a balanced number of images among various US machines are needed to clarify this aspect.
Based on this analysis, authors believe that this model deserves further investigation to increase its performance and to better describe its translational application in daily clinical practice.
Several issues and limitations have been encountered during the analysis. First, the retrospective nature of the study represents an unavoidable source of selection bias and imaging data inhomogeneity. In particular, patients were not divided in subgroups according to their hormonal status and for fertile age women (36%) the time on which the ultrasound was performed was not established. All these aspects may have impacted on textural radiomic features.
Second, the monocentric nature of the study and the qualification of our institution as a referral center for US in gynecological malignancies may not fully reflect real word data, especially in terms of US image quality.
Third, the absence of an external validation does not allow us to draw any definitive conclusion on the replicability of the model, even if the impact of this limit is reduced thanks to the heterogeneity of the tested population and the use of different US scanners.
Fourth, our patients were only screened for BRCA1/2: no other genes involved in the same homologous recombination repair pathways were included in the analysis. Therefore, we cannot exclude the presence of PVs in the other genes whose mutational status could correlate with US patterns, limiting the clinical impact of the proposed approach. Fifth, the model was developed on women with high risk factors for gBRCA1/2 PVs therefore could be applied only to this clinical setting.
These aspects will be taken into account at best to improve the power of our model in following investigations. Overall, the model could help patients who may have unnecessary genetic tests when identified by the clinical screening method. However, its actual performance, in terms of negative predictive values, does not allow applying it as a reliable screening tool to optimize patient selection for genetic screening of BRCA1/2 genes.
However, the novelty of this radiogenomics application, its potentiality as a screening tool and the encouraging preliminary results support the reliability and the value of our findings.
For these reasons, we are planning to expand the series including patients enrolled in 2018 and 2019, validate the model on external cohorts of patients and finally test it in the general population as a screening tool.
In conclusion, the authors believe that in the next future a radiogenomics approach based on ovarian US images can become a successful screening tool to identify women who are highly likely to have gBRCA1/2 PV and therefore should be assigned to genetic testing and counselling pursuing healthcare cost-containment policies.
This radiogenomic model could represent a cost-effective, highly reproducible, large-scale extendable and time saving screening tool although further investigations are necessary to improve sensitivity and specificity before its effective clinical release.