High throughput proteomics identifies a high-accuracy 11 plasma protein biomarker signature for ovarian cancer

Ovarian cancer is usually detected at a late stage and the overall 5-year survival is only 30–40%. Additional means for early detection and improved diagnosis are acutely needed. To search for novel biomarkers, we compared circulating plasma levels of 593 proteins in three cohorts of patients with ovarian cancer and benign tumors, using the proximity extension assay (PEA). A combinatorial strategy was developed for identification of different multivariate biomarker signatures. A final model consisting of 11 biomarkers plus age was developed into a multiplex PEA test reporting in absolute concentrations. The final model was evaluated in a fourth independent cohort and has an AUC = 0.94, PPV = 0.92, sensitivity = 0.85 and specificity = 0.93 for detection of ovarian cancer stages I–IV. The novel plasma protein signature could be used to improve the diagnosis of women with adnexal ovarian mass or in screening to identify women that should be referred to specialized examination.

O varian cancer is currently the 7th most common cancer across the world with estimated incidences from 4.1 to 11.4 cases per 100,000 women 1 . Since ovarian cancer is commonly caught late, the overall 5-year survival rate is only 30-40%. MUCIN-16 (also known as Cancer antigen 125, CA-125) was introduced as a biomarker for ovarian cancer in 1983 2 and is currently the most important single biomarker for epithelial ovarian cancer managment 3 . MUCIN-16 alone, however, has low sensitivity for early-stage cancer (50-62%) at a specificity of 94-98.5% 3 . The difficulties in establishing highly accurate early diagnoses with non-invasive methods, combined with the low survival rate, justifies that women with a transvaginal ultrasound (TVU) indication of adnexal ovarian mass are commonly diagnosed by surgical sampling. However, the degree of surgical overdiagnosis is high. Among women that were diagnosed by surgical sampling, only 21-30% have OC stage I-IV, while 58% have been reported to have benign tumors and the remaining 15% borderline tumors [4][5][6] . According to the Swedish GynOpRegistry statistics for 2017, 13% of the women with adnexal ovarian mass that underwent surgery developed complications related to the procedure 7 . A non-invasive diagnostic test with higher sensitivity and retained specificity that distinguishes between women with malignant and benign ovarian adnexal mass could be used to avoid over-diagnostic surgery. Application of MUCIN-16 and other biomarkers, including WFDC2 (WAP Four-Disulfide Core Domain 2, also known as HE4-human epididymal protein 4), such as in the ROMA Score (Ovarian Malignancy Risk Algorithm), can increase the sensitivity to 94.8% at a specificity of 75% 8 in patient cohorts with predominantly (74.6%) late stage (III and IV) ovarian cancers. However, the low sensitivity for detection of early-stage ovarian cancer still prohibits population screening using current biomarker tests. A recent study in the UK suggests that multi-modal tests are approaching sufficient accuracy to justify screening from a health-economic stand-point 9 . However, tests with low specificity have a high false positive rate, which will result in unnecessary anxiety and examinations and also an additional cost for the health-care system.
The presently available biomarkers are mainly used to improve diagnosis of women that experience symptoms or when imaging such as TVU or computer tomography (CT) indicate adnexal ovarian mass. The tests/algorithms then triage patients in need of surgery at tertiary cancer centers. Even in this context, identification of clinically useful biomarkers based on single or combination of proteins is challenging. Recent developments of high-throughput technologies for detection and quantification of proteins has made it possible to study thousands of biomarker candidates in a single sample. Skates and colleagues 10 have presented a statistical framework for study design, sample size calculation in discovery and replication stages and for identification of single biomarkers that can distinguish between cases and controls, with special reference to ovarian cancer. They recommend selection of the highest ranking 50 biomarkers from a discovery stage, which are then examined in a replication stage. A smaller set of replicated markers is then used to build a classifier that is tested in clinical validation studies. We have previously shown 11 that plasma protein levels for several protein biomarkers are highly correlated. This implies that sets of proteins can be identified in a discovery stage whose combined predictive power is not greater than their individual contribution. Also, biomarkers that are not significant on their own can increase the predictive power in combination with other, individually significant or nonsignificant, biomarkers. An alternative approach to the framework presented by Skates 10 is to use multivariate methods from the start, searching for combinations of biomarkers that separate cases from controls. Sample size estimates based on statistical power in relation to prediction models with linear regression is however not straightforward, and several suggestions have been presented [12][13][14][15] . All these methods rely on a range of assumptions on underlying distributions of the variables and outcome, the number of variables and expected correlation between the predicted outcomes and the actual outcomes. These factors are commonly unknown a priori, making such calculations difficult before the discovery stage.
One approach for finding optimal combinations of highly predictive biomarkers is to use exhaustive searches, such as the approach taken by Han and colleagues 16 , where 165 combinations of MUCIN-16 and a selection of three out of 11 additional biomarkers were examined for their ability to separate highgrade serous ovarian carcinoma from benign conditions. Such exhaustive approaches quickly become computationally unfeasible when the number of candidate proteins is high. For instance, choosing 4 from 1000 proteins can be done in over 40 billion ways. Another strategy is to use feature selection with machine learning frameworks to select subsets of informative markers from a larger set. Such approaches have previously been used to construct a classifier with 9 proteins selected from 299 in cystfluid separating Type 1 and Type 2 ovarian cancers 17 , or to build a classifier with 12 biomarkers selected from 92 in sera, separating ovarian cancer from healthy controls or benign conditions 18 . This is achieved by splitting the samples into a training and a test set, but with fairly small sample sizes different models are usually generated depending on the subset of samples used for training. To overcome these limitations, we developed a novel analysis strategy based on building models separating ovarian cancer from benign tumors, where we first identify smaller sets of proteins that are robustly selected across several splits, so-called cores. In the second step, we build a model by extending a core with additional proteins that have high predictive power in combination with the specific core.
Here, we aim to identify multiple mutually exclusive biomarker signatures differentiating benign conditions from ovarian cancers at different stages, grades and all histological subtypes. The signatures should be practically useful and therefore contain up to 20 proteins selected from a total of 593 characterized plasma proteins in one discovery cohort and two replication cohorts. We finally identify one model based on 11 biomarkers and age that we implement as a custom multiplex PEA assay reporting in absolute concentrations, and validate its performance in a third independent cohort.

Results
Characterization of plasma proteins. A total of 552 proteins were characterized in the discovery cohort (n = 169, Table 1) and two replication cohorts (n = 248, Table 1) using the proximity extension assays (PEA) with 6 of the Olink Proseek panels (Cardiometabolic, Cell Regulation, Development, Immune Response, Metabolism, and Organ Damage) (Methods). These measurements were combined with a previous study 19 containing data from 5 PEA panels, 460 proteins, in the discovery cohort, bringing the total number of unique proteins included in the analysis to 981. Forty-two of the 460 proteins have also been quantified in the replication cohorts using the proximity extension assay in two custom 21-plex panels as previously described 19 . Following quality controls and normalization (Methods), a common set of 593 proteins (42 proteins from the previous 5 panels and 551 from the additional 6 panels) characterized in all samples were used.
484 distinct predictive models for ovarian cancer. Models were generated using only the discovery data, according to our twostage strategy. First, mutually exclusive protein cores, consisting of a smaller set of proteins, were selected by repeatedly splitting the data into training and test sets and retaining proteins that were present in at least 70% of the models (Methods, Fig. 1a, b). Additional biomarkers were subsequently added to each core using a stepwise forward selection approach (Methods, Fig. 1c). The addition of proteins was terminated when the total model size was 20 proteins, or the next protein to be added did not substantially increase the performance of the model (Methods). Using this strategy, we generated models to distinguish benign tumors from ovarian cancer stages I-II, III-IV, and I-IV with focus on either sensitivity, specificity or both (Methods). This analysis resulted in 484 unique models with at least one protein not overlapping between each pair of models (mutually exclusive protein signatures). The individual performance in the testpartition of the discovery data for the highest ranking 50 models is shown in Fig. 2a. MUCIN-16, which is the clinically most useful single biomarker today, was the most common protein across cores in the 50 highest ranking models by sorting on their average sensitivity and specificity in the test set from the discovery data (Fig. 2b). Our search strategy specifically excludes sets of protein, and 448 of the detected cores did not contain MUCIN-16. In general, when MUCIN-16 was not included, the models contained a higher number of proteins (9-20) than when it was included (8)(9)(10)(11)(12)(13)(14)(15)(16)(17). In total, 371 proteins were included in a core, or as an additional protein in at least one model. Among the top-ranking 50 cores and models, 19 proteins made up the core-set and additional 115 proteins were selected in the addition phase (Fig. 2b, c). The performance of the 484 models in the test data is listed in Table 2 and a complete account of the models and their performances are listed in Supplementary Data 2.
Model performances in the replication cohorts. The performance of each model created from the discovery data was then evaluated in two replication cohorts (see Methods). The performance ranges of the models are shown in Table 2. The topranking models all contained MUCIN-16, but overall, the average performance of models with MUCIN-16 did not display any       Top-ranking model. The top-ranking of the 484 models included a three-protein core with MUCIN-16, TACSTD2, and SPINT1. This core was extended with 11 additional proteins (FCGR3B, TRAF2, GKN1, CST6, SEMA4C, NID2, CEACAM1, CLEC6A, MILR1, CA3, and CDH3). The distribution of abundance levels for the core proteins in the 1st replication in patients with ovarian cancer stages III-IV and those with benign tumors are shown in Fig. 3a. The core proteins have clearly deviating levels between the cancer cases and controls and this is further illustrated by a principal component analysis (PCA) based on the three core proteins (Fig. 3b). The additional proteins were then selected based on explained variance in the decision after adjustment for the variance explained by the proteins in the core (Methods). Therefore, some of the additional proteins ( Fig. 3c) do not differ in abundance between cases and controls when examined separately, but contribute to the separation when examined in combination with the previously included proteins. The separation between benign tumors and ovarian cancer stages III-IV for the top-ranked 14-protein model is shown in the PCA in Fig Table 3). At a sensitivity above 0.93 in the 1st and 2nd replication cohorts, the model achieved a specificity of 0.27 and 0.28, respectively, and at a specificity above 0.93 a sensitivity of 0.86 and 0.80. Performance measures for the discovery and replication cohorts for all the different stages investigated are listed in Table 3.
Proof-of-concept model for practical use. Several factors in addition to the ability to separate cases and controls may influence the choice of the proteins included in a multiplex test, such as comparison with established tests, measurable concentration range, and sensitivity of proteins to hemolysis of red blood cells causing leakage of proteins into the plasma. Taking these limitations into account, we again started from the top-ranking core of the 484 models and allowed additional selection but restricted the search to proteins present in models with the highest performance in the discovery cohort. This list of possible additions was filtered by removing proteins sensitive to exposure to hemolysate 20 and proteins that occur in much higher concentrations in human plasma than those in the selected core, and therefore would need to be diluted before assayed with PEA 20 . Here, we removed proteins required less than 7.5 mg/ml hemolysate, or that required dilution of 1:2025 and this filtering process retained 414 proteins. We then performed model selection as before based solely on the discovery data (benign tumors versus ovarian cancer stages III-IV) and identified a model consisted of 8 proteins. We finally added three proteins (WFDC2, KRT19, and FR-alpha) based on their previous association with ovarian cancer stages I-II in our modeling, or in the previous literature 18,21,22 .  The selected 11-protein panel consisted of the three core proteins  MUCIN-16, SPINT1, TACSTD2, and the additional proteins  CLEC6A, ICOSLG, MSMB, PROK1, CDH3, WFDC2, KRT19, and FR-alpha. The performance of this 11-protein panel was evaluated in the two replication cohorts (Table 3). In the 1st replication cohort the AUC = 0.90, PPV = 0.94, sensitivity = 0.91 and specificity = 0.95 to distinguish benign tumors from ovarian cancer stage III-IV.
Validation of proof-of-concept model. In order to validate the performance of the 11-protein proof-of-concept model we then developed a custom PEA-assay 23 that measured the 11 proteins and used this to characterize protein abundance levels in a third replication cohort (Tables 1 and 3). Here, calibration samples (see Methods for details) were included in the custom assay in order to have the final readout in absolute protein concentrations rather than NPX. Concentration ranges of the custom assay and performance measures are given in Supplementary Data 4. The third replication cohort was first split into two equal parts, a training set, and a validation set, in terms of size and proportion of benign and malignant (stages I-IV) tumors. A linear regression model was then trained, employing fivefold cross-validation using the training part only. In the training-set this model achieved an AUC of 0.93 (%95 CI 0.88-0.98) in separating benign from stages I-IV (malignant), and a similar performance was observed in the validation set (AUC = 0.95, %95 CI 0.91-1.00, Fig. 4a). Since the performance in the validation set was highly similar to the training set with no statistical difference (DeLong's test, p-value = 0.53), a final model was generated using fivefold crossvalidation with the entire third replication cohort in order to capture as much variation as possible. This model (Supplementary Data 4) achieved an AUC of 0.94 (%95 CI 0.91-0.98) with a sensitivity of 0.86 at a specificity of 0.93 at the point closest to perfect classification (Supplementary Data 5). Next, we trained a model using the 11 proteins and age at diagnosis (Supplementary Data 4, Fig. 4a). As before, there was no difference in AUC for the training and validation sets (DeLong's test, p-value = 0.62) and using the whole cohort, this model achieved an AUC of 0.94 (%95 CI 0.91-0.98) with a sensitivity of 0.85 at a specificity of 0.93 at the point closest to perfect classification. This was determined at a cut-off of 0.3937. We also recorded cut-offs for focus on sensitivity or specificity over 0.98. With this focus, the model achieved sensitivity and specificity of 0.99/0.31 or 0.77/0.98 at cut-offs of 0.2501 and 0.5474, respectively (Table 4). We also trained a model based on WFDC2, Mucin-16 and age at diagnose for comparison and a model based on age and 7 biomarkers (MUCIN-16, TACSTD2, MSMB, PROK1, WFDC2, KRT19, and FR-alpha) that excluded the proteins with the highest technical variation in our custom-assay ( Supplementary Fig. 3). In both these models there was no difference in performance between the training and validation proportions of the data (DeLong's test pvalues = 0.60 and 0.34, respectively) and again, final models were created based on the entire cohort. Performance measures for all 4 models based on the custom assay are available in Supplementary Data 5. In general, the models trained on benign vs malignant (stages I-IV) tumors are better at separating late stages (stages III-IV, AUC-range 0.95-0.98) than early (stages I-II, AUC-range 0.79-0.88) and has lower performance separating stages I-II from stages III-IV (AUC-range 0.74-0.77, Fig. 4b, Supplementary Data 5).
Finally, we included also samples from the third replication cohort that had been diagnosed with borderline ovarian cancer and plotted the prediction scores from the 11 proteins plus age  (Fig. 4c). From Fig. 4c left panel it is clear that only samples with stages II or higher have prediction scores above 0.9 while only benign or borderline samples have a score lower than 0.15. As compared to the use of the WFDC2, MUCIN-16 plus age model (Fig. 4d, left  panel), there is a more than 2-fold increase in the number of women that fall in these prediction score categories, i.e., above 0.9 (n = 34 vs 5) or below 0.15 (n = 15 vs 5). This is also illustrated in Figure 4c, d, right panels, where the distribution of prediction scores for each diagnosis is shown. The cut-offs used for "best point", high sensitivity or high specificity are also illustrated by horizontal lines. The prediction scores from the 11 protein plus age model in late stage ovarian cancers (Stages IV) are significantly higher than that of the 2 protein plus age model, while the predictions scores in the Benign group are significantly lower (Wilcox-test, Bonferroni adjusted cut-off: 0.05/6 = 8.3 × 10 -3 , p-values = 5.5 × 10 −3 (Stage IV) and 2.0 × 10 −6 (Benign), respectively). The prediction scores for the borderline samples fall between the benign and stage I samples (Fig. 4c, right panel) and there is not obvious cut-off that for distinguishing these from either the benign or malignant tumor samples.

Discussion
The current study was designed to identify mutually exclusive predictive biomarker signatures containing up to 20 plasma proteins differentiating benign tumors from different stages of ovarian cancers. We started from a large number of plasma proteins, not selected based on prior association with ovarian cancer, utilizing high-throughput multiplexed proteomics assays. The models were developed using a discovery cohort, and the performance of the models was then evaluated using two replication cohorts. In addition to the 484 biomarker signatures obtained using our computerized strategy, we developed one model considering protein-specific criteria such as abundance range and sensitivity to hemolysis. Finding combinations of predictive, robust, biomarkers is computationally intensive, and with many hundreds of proteins, exhaustive searches of combinations of up to 20 proteins is not feasible. To this end, we developed a strategy for identification of highly predictive unique signatures using hierarchical exclusion of individual proteins. By design, this led to the discovery of many signatures that did not contain MUCIN-16, although this protein was the strongest univariate biomarker among the ones we studied. Overall, the signatures without MUCIN-16 contained a higher number of different proteins than signatures with MUCIN-16, but there was no clear difference in prediction performance of the group with and without MUCIN-16. Our top-ranking model achieved a sensitivity of 0.99 and specificity of 1.0 in the test proportion of the discovery data for separating benign tumors from ovarian cancer stage III-IV. A recent study by Boylan and colleagues 18 reports perfect classification (AUC = 1.0 and AUC = 1.0) of benign tumors and late-stage ovarian cancer and very high performances (AUC = 0.98 and AUC = 0.85) using either MUCIN-16 or WFDC2 alone, by analysis of a single cohort with proteins measured using the same PEA technology as in our study. In our 1st replication cohort, MUCIN-16 alone had lower AUCs of 0.70, 0.65, and 0.51 for separating benign tumors from ovarian cancer stages III-IV, I-IV, and I-II, respectively (Fig. 3f, g). The difference in performance between our study and that by Boylan and colleagues 18 could be due to geographic origin of the cohorts (USA and Sweden), biological nature of the sample (i.e., serum versus plasma), or differences in sample sizes and model evaluations. Boylan and colleagues 18 used 21 women with benign conditions and 21 with late-stage ovarian cancer, as compared to 71 and 56 in our study. Another study by Han and colleagues 16 reported a sensitivity of 0.87 at a specificity of 1.0 for separating benign tumors from ovarian cancer stage I-IV, using the four proteins MUCIN-16, E-CAD, WFDC2, and IL-6. Our top-ranked model had a sensitivity of 0.85 and specificity of 0.91 under the same conditions. Similar to the results of these previous studies 16, 18 , the performance of our models in the test-proportion of the discovery data is very good, with some models showing perfect classification. We also evaluated the selected models in two replication cohorts and found the performance similar, while somewhat lower than in the discovery set. This either implies that there are underlying differences between the cohorts, such as in pre-analytical conditions, or that the models are over-trained with respect to the samples in the discovery cohort. The performance in the test-proportion of the discovery cohort should, therefore, be considered less certain than the results obtained in the replication cohorts. In our study, the benign tumors and the cancer samples from the 2nd replication cohort differ in pre-analytical context, which could explain part of the lower performance as    Fig. 4 Final models' performance in the 3rd replication cohort. a ROC-curves for the test/training (gray), validation (black) and final model (red) for each of the 4 models. The AUC is taken from the performance in the validation partition. All models were trained on benign vs malign (stages I-IV) samples. b ROC-curves for the 4 final models when evaluated on subsets of stages. c Distribution of outcomes in ranges of prediction scores (left) for the '11-plex + Age' model and distribution of prediction scores for each outcome (right). In the right panel, the three cut-offs for 'best-point (BP)', 'focus on sensitivity (FSE, sensitivity ≥0.98)' and 'focus on specificity (FSP, specificity ≥0.98)' are illustrated by horizontal dashed lines. The solid black lines indicate the mean prediction score in each outcome group. d As (c) but for the '2-plex + Age' model compared to using the 1st replication cohort. We then implemented our final, proof-of-concept, model into a custom assay reporting in absolute protein concentrations. Since the readout for the custom assay differ from the standard PEA-readout used in the discovery and first two replication cohorts the model coefficients needed to be retrained. This could lead to overfitting of the model to the investigated cohort. To try and overcome this, we employed a 5-fold cross-validation schema to train the final model using 50% of the third replication cohort and kept a withheld validation proportion to estimate the performance on samples not used in the training. As the performance in these two subsets did not essentially differ, the retraining of the models does not seem to be overfitted with respect to the samples used. We, therefore, used a 5-fold cross-validation schema using the entire third replication cohort for the final model fitting. This does however not necessarily guarantee that the performance of the model will remain the same in additional cohorts with e.g., different ethnic compositions or when applied to samples with large differences in pre-analytical handling. We also noted that the performance of our model is slightly better in the third replication cohort, where the AUC was 0.95 (%95 CI 0.91-1.00) in the validation-proportion, as compared to the performance in the test-proportion of the discovery and first two replication cohorts, where the model had AUCs ranging from 0.82-0.88. This could be due to the wider dynamic range of the custom assay, but indicates that the performance of the model is robust. A second contributing factor could be that cases are compared to a group diagnosed with benign tumors, representing heterogenous conditions. In the third replication cohort used here, the most common benign tumors were diagnosed as 'Serous cyst' (29.2%), followed by 'Mucinous cyst' (20.8%), 'Mesonephric or inclusion cyst' (17.0%), 'Stromal cyst' (11.3%), 'Teratoma' (11.3%), 'Endometrioma' (6.6%) and finally, 'Myoma' (3.8%). This highlights the importance of understanding the context in which a biomarker test is to be used as compared to the setting used for development of the model.
We note that there is a group (n = 3) of late=stage cancers that gets low prediction scores using our model (Fig. 4c, right panel, stage IV). Comparing this group with low scores to the others diagnosed with the same stage (n = 21), we found no statistical difference (wilcox.test, two-sided) in either age (p = 0.50), BMI (p = 0.93). These 3 samples do however have lower clinical Mucin-16 values (p = 0.011) with a median value of 49.5 U/ml compared to the group with high prediction scores that have a median value of 1312.5 U/ml. From this observation, it is clear that there are samples that will still be hard to find or distinguish using the biomarker model presented here.
Some of the proteins in the 11-protein panel, aside from MUCIN-16 and WFDC2 (HE4), have previously been associated to ovarian cancer. TACSTD2 (tumor-associated calcium signal transducer 2) expression has been associated with decreased survival of ovarian cancer and proposed as a prognostic factor 24 , and a biomarker for targeted therapy 25 . SPINT1 (matriptase, HAI-2) is a type II transmembrane serine protease expressed on epithelial ovarian tumor cells. In advanced stage ovarian tumors, matriptase is expressed in the absence of HAI-1, its inhibitor, indicating that an imbalance between matriptase and HAI-1 is important in the development of ovarian disease 26 . Matriptase has also been proposed as an adjuvant therapeutic target for inhibiting ovarian cancer metastasis 27 . Analysis of circulating tumor cell RNA has seen an increased expression of KRT19 (keratin, type I cytoskeletal 19), but no studies of the plasma protein level have been performed 28 . FR-alpha (folate receptor alpha, FR-alpha) is a GPI-anchored glycoprotein and serum levels has been found to be elevated in ovarian cancer patients 29,30 and correlated to both clinical stage and histological type 31,32 . Finally, decreased expression of MSMB (beta-microseminoprotein) has been shown to correlate with reduced survival of invasive ovarian cancer 33 .
In order to study the potential of using the protein panels in diagnosis or screening, we determined their performance after tuning the models prioritizing either specificity or sensitivity. A  (Table 4). Further studies are needed using samples collected at different time-points prior to diagnosis to evaluate the potential of using the panel in population screening. In screening, the aim is not to distinguish between benign tumors and cancer, but between healthy women and cancer, and it is likely that there will be more pronounced differences when comparing to a healthy population. In support of this notion, we have shown in a previous study 19 that the sensitivity to distinguish population controls from stage I-IV cancer was 0.62 and stage III-IV was 0.78. Future studies including age-matched population controls have to be conducted to determine the performance of the final 11-protein biomarker set in population screening.
In summary, we have developed a strategy for the identification of protein cores that resulted in mutually exclusive combinations of protein signatures that can separate between benign tumors and ovarian cancers. The results demonstrate the ability to achieve high performance characteristics without including MUCIN-16. We also show that broad searches for novel combinations of protein biomarkers that on their own are not necessarily good predictors is a powerful approach for finding relevant biomarkers for disease.

Methods
Samples. Plasma samples of women with benign and malignant ovarian tumors were collected from either the U-CAN collection 35 at Uppsala Biobank, Uppsala University, Sweden or the Gynaecology tumor biobank 36 at Sahlgrenska University Hospital, Göteborg, Sweden, as previously described 19 (Table 1). All samples from the biobanks were included based on ovarian cancer diagnosis or patients that had been surgically diagnosed with benign conditions based on suspicion of ovarian cancer. Since the distribution of individual protein abundance levels in the samples was unknown and all available samples were to be extracted from the biobanks, no univariate power calculations were carried out prior sample collection. Patients that had received neoadjuvant treatment prior to surgery were excluded from the analysis. The discovery cohort consisted of 90 patients diagnosed with benign tumors and 79 patients with ovarian cancer stages I-IV. Samples were collected at time for primary surgery under full anesthesia but before incision. All women had at least 6 h fasting before sample collection. The first replication cohort consisted of 71 patients diagnosed with benign tumors and 100 patients with ovarian cancer stages I-IV and were collected under the same conditions as the discovery cohort. The second replication cohort consisted of 77 patients with ovarian cancer stages I-IV. The second replication samples were collected at time of diagnosis, from awake patients, by a trained nurse. The third replication cohort consisted of 106 patients with benign conditions, 28 with borderline diagnosis and 93 with ovarian cancer stages I-IV. All samples from the third replication cohort were collected at time of diagnosis, from awake patients, by a trained nurse.
Ethical compliance. Written consent was obtained from all participants before the samples used here were deposited in the biobanks. One woman with her sample in the Gynaecology tumor biobank at Sahlgrenska University Hospital withdrew her consent after the samples had been acquired from the biobank and this sample was excluded from the analyses conducted here. The study was approved by the Regional Ethics Committee in Uppsala (Dnr: 2016/145) and Göteborg (Dnr: 201-15).
Protein measurements. We have previously quantified 460 proteins from the Olink Multiplex Cardiovascular II, Cardiovascular III, Inflammation, Neurology and Oncology panels in the discovery cohort using the proximity extension assay (PEA) 19 . The PEA is an affinity-based assay which characterizes abundance levels of a pre-determined sets of proteins. For each measured protein, a pair of oligonucleotide-labeled antibodies probes target the protein and if both probes are in close proximity, a PCR target sequence is formed by a proximity-dependent DNA polymerization event. The resulting sequence is then detected and quantified using standard real-time PCR. Forty-two of these have also been quantified in the first and second replication cohorts using PEA in two custom-design 21-plex panels 19,23 . The 42 proteins in the custom-design 21-plex proteins were selected from the 460 based on their relationship with gynecological cancers as described earlier 19 . Here, an additional 552 proteins were analyzed using 6 additional PEApanels (the Olink Multiplex Cardiometabolic, Cell Regulation, Development, Immune Response, Metabolism and Organ Damage) and real-time PCR using the Fluidigm BioMark™ HD real-time PCR platform 37 in the discovery and replication cohorts. A complete list of the 1012 assays corresponding to 981 unique proteins are listed in Supplementary Data 1. The samples were randomized across plates and normalized for any plate effects using the built-in inter-plate controls according to manufacturers' recommendations. The PEA gives abundance levels in NPX (Normalized Protein eXpression) that is on log2-scale. Each assay has an experimentally determined lower limit of detection (LOD) defined as three standard deviation above noise level. Here, all assay values below LOD were replaced with the defined LOD-value. Samples and proteins that did not pass the quality control were removed. After quality control, 42 proteins from the custom panels and 551 from the additional 6 panels were kept. Assay characteristics including detection limits calculations, assay performance and validations are available from the manufacturer (www.olink.com).
The proteins from the proof-of-concept model were quantified using a custom 11-plex assay in the analysis of the third replication cohort. Description of the development process for combining protein assays into custom multiplexed reactions and the technology behind having final readout in absolute concentrations have been published earlier in a white-paper 23 . In brief, normal and disease state occurring protein concentration ranges in circulating plasma are accounted for and the dynamic range for each individual assay optimized to take this into account. In addition, standard curves for all individual proteins have to be established by analysis of a wide range of recombinant antigen concentrations. In the final test, triplicate measures of calibrators at 4 known concentrations (blank, low, mid and high concentrations) of each protein were included in each run. These were used for normalization and the normal PEA-readout (NPX) and to estimate absolute concentrations by comparing to established standard curves. Here, each sample in the third replication cohort was run in duplicates or triplicates and a mean value over the replicates was used in the analyses. If all readouts were below or above the limits of detections, no mean-value was calculated but instead replaced with the LOD-values as described above. The data were then transformed to log2-scale.
Model generation. In order to identify non-overlapping models, we devised a twostage strategy. In brief, the strategy first identifies a small set of proteins, a 'core', typically consisting of 2-6 proteins with relatively high performance. This core is then extended with additional proteins creating a full model of up to 20 proteins. The whole process is then repeated excluding one protein at a time from the core ensuring that the "next" core does not overlap with a previously detected core. In detail, the strategy breaks down into the following steps. First, the discovery set was randomly split into a training and a test set with 50% of the samples in each. A linear regression model was generated on the training set using the R-package 'glmnet' 38 with 'alpha' = 0.9 and optimized using 10-fold cross-validation in the training-set as implemented by the 'cv.glmnet'-function. The training/test-split was then repeated 50 times and a 'core' was then defined as consisting of the proteins present in at least 70% of the generated models from the 50 runs. In order to find mutually exclusive cores, the core-generating process was repeated in a recursive manner, excluding one protein at a time from the previous core from the available protein pool presented to 'cv.glmnet'. For each newly generated core, the process was then repeated unless the core contained more than a specific number of proteins or had a sensitivity or specificity below a specified cut-off. For each new search, all previously excluded proteins were made unavailable to the current selection. For computational reasons, the searches were cancelled if more than 20 proteins had been excluded. The process of defining the cores is outlined in Fig. 1a, b. In the second step of the strategy, each core was extended to a full model by adding additional proteins. These additional proteins were selected from the pool of available proteins in a stepwise forward selection. First, the variance in the decision explained by the core was removed by keeping the residuals from a linear model generated with the protein values in the core as input and the decision as output. Then, the variance explained by any other available protein in this adjusted outcome was calculated and the protein explaining the most remaining variance in the decision was added to the model and the contribution of that protein to the explained variance in the decision was adjusted for as described above. The addition of proteins was repeated until the best candidate protein did not explain more than 1% of remaining variance or the total number of proteins in the model exceeded 20 proteins (Fig. 1c). Different parameter settings were used to identify models that could have either high sensitivity or high specificity or both, depending on the final application and to account for the fact that it is much more difficult to separate stages I and II from benign tumors compared to stages III and IV. For benign tumors versus stages I-II, the core had to be 2-6 proteins in length and have a sensitivity of at least 0.8, or a sensitivity and specificity above 0.6. For stages III-IV, the allowed core size was 2-5, and had to have a sensitivity above 0.8 or a sensitivity and specificity above 0.7. Finally, for stages I-IV the allowed core size was 2-6 proteins, and the models were required to have a sensitivity above 0.8 or a sensitivity and specificity above 0.7.
Performance estimates in the first two replication cohorts. As the second replication cohorts lacked patients with benign tumors, the benign tumors from the first replication cohort was used in both replication cohorts. Due to the relativeness of the NPX-scale and that the data in the discovery and replication sets were generated in different laboratory analysis runs, including parts of the data that were generated using a custom-panel 19 , the replication cohorts were split into a test and training set (50-50) and model coefficients were re-determined with the 'lm' function in R. The performance of the models was then estimated in the test proportion of each replication cohort separately. This was repeated 50 times for each model and the mean and standard deviation of sensitivity, specificity, positive and negative predictive values (PPV/NPV) and AUCs were recorded. PPV was calculated as TP/(TP + FP) where TP is the number of true positives and FP the number of false positives. NPV was calculated as TN/ (TN + FN) where TN is the number of true negatives and FN is the number of false negatives. The sensitivity and specificity were calculated at three different points on the ROC curve. The 'best point,' defined as the closest (Euclidean distance) point to perfect classification, and by selecting a minimum sensitivity or specificity of 0.93.
Fixation of the models in absolute concentration. The third replication cohort and was first split into two equal parts, a training set and a validation set, in terms of size and proportion of benign and malign (stages I-IV) samples. A linear regression model was then trained employing fivefold cross-validation using the training part only. The models were trained using the 'cv.glmnet' with alpha = 0. The performance of the model was then evaluated on the validation set. Difference in performance (AUC) from the training and validation was evaluated by a DeLong-test as implemented in the R-package 'pRoc' 39 . When no difference in performance was detected between the training and validation sets, a final model was generated fivefold cross-validation as above using all samples with benign or malign (stages I-IV) status. Model coefficients was extracted from the cv-stage at a λ within one standard error of the minimum ('lambda.1se' in 'cv.glmnet'). Rpackage 'caret' 40 was used to train the final model. In the modeling, the raw output from the linear regression was transformed to lie between 0 and 1 by a linkfunction, f x ð Þ ¼ e x ðe x þ1Þ .
Statistics and reproducibility. All calculations were done using R 41 (version 3.4.2). Models were fitted using functions from the R-package 'glmnet' 38 (version 2.0-16), 'caret' 40 (version 6.0-80) and 'pROC' 39 (version 1.12.1). Performance measures for the PEA-panels are available from the manufacturer's webpage, www.olink.com. A full description of the custom PEA assay including assay performance measures, concentration ranges of all ingoing variables, all model coefficients, cut-offs, and prediction performance measures are given in Supplementary Data 4 and 5.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The complete data including replicate measurements in absolute concentrations for the 11-