A non-invasive method for concurrent detection of early-stage women-specific cancers

We integrated untargeted serum metabolomics using high-resolution mass spectrometry with data analysis using machine learning algorithms to accurately detect early stages of the women specific cancers of breast, endometrium, cervix, and ovary across diverse age-groups and ethnicities. A two-step approach was employed wherein cancer-positive samples were first identified as a group. A second multi-class algorithm then helped to distinguish between the individual cancers of the group. The approach yielded high detection sensitivity and specificity, highlighting its utility for the development of multi-cancer detection tests especially for early-stage cancers.

Data generation and analysis. Positive ion mode UPLC-MS/MS interrogation of the serum metabolome of samples described in Table 1 resulted in the detection of > 20,000 spectral features (R t , m/z pairs). The untargeted metabolomics approach (Fig. 1) generated a large metabolites list, which was further divided into subset of normal control, endometrial cancer, breast cancer, cervical cancer, and ovarian cancer with 5895, 5971, 5982, 6300 and 6336 metabolites respectively. The total number of unique metabolites identified in our study was 7596 in number. Distribution of the number of unique metabolites identified in samples from the normal control, and individual cancer types, as a function of the age groups, is shown in Fig. 2. Subsequent processing of this data   Here, × 0 is a constant number, I i (1 ≤ i ≤ 2823) is the intensity of metabolite i present in the respective sample. Supplementary Figure S1 gives the value of coefficient xi (1 ≤ i ≤ 2823) for each metabolite. The evaluation of the trained model as applied on test set for a single partition of data is shown in Fig. 5B. The scatter plot shows the Model Score for Normal Controls and BECO Cancer cases. The model scores are clearly seen to be different between Normal Controls and BECO Cancer samples where on applying a threshold of 5 to differentiate between two types results in a confusion matrix as shown in Fig. 5B. Sensitivity, Specificity and Accuracy were calculated from formulae given in "Methods", and this resulted in Sensitivity of 98%, Specificity of 98.3%, and an Accuracy of 98%.
Differentiating endometrial, breast, cervical, ovarian from each other. In the second step, another multiclass AI model was layered on top of the first model, which acted on the predicted cancers samples from the first model (breast, endometrial, cervical or ovary) and gave a multiclass score to each sample: one score for each disease class denoting the probability of the sample belonging to the respective disease class. y_score = x0 + x1 * I 1 + x2 * I 2 + x3 * I 3 + · · · + x2823 * I 2823    www.nature.com/scientificreports/ control samples were added to the test set. Then, a one versus rest (OVR) classifier multiclass classification model was made using the training samples to give AI model-2. Following this, a two layered modeling scheme was applied on the test set. That is, firstly, AI model-1 differentiating BECO versus normal samples was applied on the test set. Then, AI model-2 was applied on the resulting predicted BECO samples. This resulted in four scores for each sample, with each score defining probability of the respective sample belonging to one of the four classes.
For the multi class model: AI model-2, a one versus rest (OVR) classifier multiclass classification model was made using the training samples. The trained algorithm finds 4 scores for each of the sample according to the formulae below: Here, y0, z0, a0, b0 are constant number, I i (1 ≤ i ≤ 2823) is the intensity of metabolite i present in the respective sample. Supplementary Figures S2 gives the value of coefficient yi, zi, ai, bi (1 ≤ i ≤ 2823) for each metabolite.
To determine how well our multiclass model differentiates between the individual disease categories of the BECO group of samples, as well as from normal control, we plotted the scores obtained from multiclass model. As shown in Fig. 7A, we plotted the multiclass model Endometrial Score for Endometrial Cancer samples and set of Breast, Cervical and Ovarian (BCO) Cancer samples. The model scores are clearly seen to be different between Endometrial and BCO Cancer samples where on applying a threshold to differentiate between the two sets results in a confusion matrix as shown in Fig. 7A. Here, the normal samples were also included in the control group to get the sensitivity, specificity values for Endometrial cancer versus the rest of the groups. Sensitivity, Specificity and Accuracy were calculated from formulae given in "Methods", which resulted in Sensitivity of 87%, Specificity of 93%, and an Accuracy of 91.6%.
We next plotted multiclass model taking Breast Cancer Scores for Breast Cancer Samples versus the scores from the combined set of Endometrial, Cervical and Ovarian (ECO) Cancer samples (Fig. 7B). The model scores are clearly seen to be different between Breast Cancer and ECO Cancer samples where on applying a threshold y_score1 = y0 + y1 * I 1 + y2 * I 2 + y3 * I 3 + · · · + y2823 * I 2823 y_score2 = z0 + z1 * I 1 + z2 * I 2 + z3 * I 3 + · · · + z2823 * I 2823 y_score3 = a0 + a1 * I 1 + a2 * I 2 + a3 * I 3 + · · · + a2823 * I 2823 y_score4 = b0 + b1 * I 1 + b2 * I 2 + b3 * I 3 + · · · + b2823 * I 2823 www.nature.com/scientificreports/ to differentiate between two sets results in a confusion matrix as shown (Fig. 7B). Here, the normal samples are also included in the control group to get the sensitivity, specificity values for Breast cancer versus the remaining groups. Sensitivity, Specificity and Accuracy were calculated from formulae given in "Methods": AI modeling of the data. This results in Sensitivity of 93%, Specificity of 95%, and an Accuracy of 94.4%. Figure 7C, shows a plot of the multiclass model where the Cervical Score for Cervical Cancer were compared with the scores from the combined set of Endometrial, Breast and Ovarian (EBO) Cancer samples. The model scores are clearly seen to be different between Cervical and EBO Cancer samples where on applying a threshold to differentiate between two sets results in a confusion matrix as shown. Here, the normal control samples were also added in the control group to get the sensitivity, specificity values for Cervical versus rest. Sensitivity, Specificity and Accuracy were calculated from formulae given in "Methods" as Sensitivity of 87%, Specificity of 90%, and an Accuracy of 87.6%. Figure 7D shows a plot of the multiclass model taking the Ovarian Score for Ovarian Cancer Samples versus scores from the combined set of Endometrial, Breast and Cervical (EBC) Cancer samples. The model scores are clearly seen to be different between Ovarian Cancer and EBC Cancer samples where on applying a threshold to differentiate between two sets results in a confusion matrix as shown. Here, the normal samples are also added to the control group to get the sensitivity, specificity values for Ovarian versus rest. Sensitivity, Specificity and Accuracy were calculated from formulae given in "Methods", which resulted in Sensitivity of 86%, Specificity of 93%, and an Accuracy of 92%.
Feature ranking and literature validation of select features. Since our matrix features identify named metabolites from the HMDB database, the resulting model becomes explainable in terms of extracting mechanistic and other relevant insights related to the women-specific cancers. To enable this, we performed feature ranking, where the weights of the individual features from the model were first sorted. Following this, an adding of one feature at a time approach was used until the desired sensitivity-specificity was obtained. The top 100 metabolites obtained for both AI models, AI model-1 and AI model-2 are present in Supplementary  Table S3. Table 2 gives a list of twenty-five metabolites from the top 100 metabolites identified in AI model-1, which contribute to distinguishing the women-specific cancer (BECO) group from normal controls. It is evident that list comprises of diverse classes of metabolites that include lipids, nucleosides/nucleotides, amino acids and modified amino acids, acylcarnitines, steroids and dipeptides. While all of these metabolites have been implicated either in tumor growth or progression (Table 2), it is evident that they span a multiplicity of both anabolic and catabolic pathways (Table 3). This is consistent with emerging evidence that tumor cells have highly complex metabolic requirements, and that numerous pathways are required to complement glucose-and glutamine-dependent biomass production 38 . Representative metabolites that comprise the signature that helps to distinguish the individual cancers (i.e. breast, endometrial, cervical, or ovarian cancer) are listed in Supplementary Tables 4-7.

Discussion
With the increasing burden of cancer mortality in women 39,40 , early detection to improve treatment outcomes has now become a priority. Unfortunately, however, reliable and accurate methods for early detection of most cancers are still not available. The problem is further exacerbated, particularly in middle-to low-income settings, by the relatively high costs of cancer screening, especially because current methods largely allow only for diagnosing one cancer at a time. As opposed to this, any method that can simultaneously detect multiple cancers at the early stage would find greater applicability simply because of the fewer number of diagnostic procedures that will need to be undertaken, and also the associated reduction in cost that it will entail. The utility of such a multi-cancer detection test would be further enhanced if it involved a non-invasive procedure, and if the test accuracy were to be high.
In the present report we describe an integrated method that can simultaneously detect Stage 0/I of all four of the women-specific cancers with high sensitivity and specificity. The cancers detected are breast, endometrial, cervical, and ovarian cancers. Our approach combined an untargeted UHPLC-MS/MS analysis of the serum metabolome, with the subsequent interrogation of the data using machine learning algorithms. A key aspect of our data analysis pipeline was the generation of a matrix wherein spectral features from the mass spectrometry profiles of the samples were translated into known metabolites identified using the HMDB database. A PLSDA plot revealed that the information content in the matrix was sufficient to clearly distinguish the cancer groups from the normal control group, and also achieve at least a reasonable degree of resolution between the individual cancer subsets. Consequently, this matrix provided the basis for developing an AI algorithm for early-stage cancer detection. For this, we employed a two-step strategy. In the first step we developed an algorithm (AI model-1) for distinguishing between the cancer samples and normal control samples. As our results show, we were indeed successful in identifying the BECO group of samples with high sensitivity and specificity. Subsequent to this, a second AI model (AI model-2) was developed in order to distinguish between the individual cancers of the BECO group. For this a one-versus-rest (OVR) classifier multiclass classification model was developed and as shown here, this model yielded a reasonably high accuracy in terms of identifying the identifying the tissue of origin of the cancer in samples of the BECO group. Efforts are currently underway to further improve the accuracy of AI model -2.
Thus, our studies show that combining untargeted metabolomics with machine learning approaches for data analysis provides an attractive way forward for developing highly accurate multi-cancer detection approaches. Furthermore, especially noteworthy about our results is the high detection accuracy obtained for early-stage cancers, which is significantly superior to that of the other approaches being explored to date [42][43][44][45] . It will of interest to determine if the scope of this approach can be expanded to include simultaneous detection of additional www.nature.com/scientificreports/ cancers. In addition, a more extensive sampling of patients across a wider diversity of racial and ethnic groups would also help in determining the robustness of the approach.

Methods
Study design and sample collection. All samples used in this study were purchased from three sepa-  www.nature.com/scientificreports/ and HER2 markers in the donors. To serve as controls in our assay, we also procured additional sera that were from normal volunteers. The total number of samples across all groups was 1369, and they were stored at − 20° C for the short term prior to use. Ultrahigh performance liquid chromatography-tandem mass spectroscopy (UHPLC-MS/ MS). Untargeted metabolomics were performed using Dionex LC system (Ultimate 3000) coupled online with QExactive Plus (Thermo Scientific). Each extracted metabolite sample was injected (10ul for positive ESI ionization) onto Acquity UPLC HSS T3 from Waters (1.8 micron, dimensions -2.1 × 100 mm, Part No. 186003539), which was heated to 37 0 C. The flow rate was 0.3 ml/min. Mobile phase A was (water + 0.1% formic acid), and mobile phase B was (methanol + 0.1% formic acid). The mobile phase was kept isocratic at 5% B for 1 min, and was increased to 95% B in 7 min and kept for another two min at 95% B, the mobile phase composition returned to 5% B in 14 min. The ESI voltage was 4 kV. The mass accuracy of QExactive mass spectrometry was less than 5 ppm and calibrated at recommended schedule prior to each batch run. The mass scan range is from 66.7 to 1000 Da, and resolution was set to 70,000. The maximum inject time for orbitrap was 100 ms while, AGC target was optimized with 1e6.
Quality assessment and quality control. Several types of controls were analyzed in concert with the experimental samples: blank gradient runs were provisioned at every alternate sample run; a pooled QC sample generated by taking a small volume of each experimental sample, served as a technical replicate throughout the data set; also allowed instrument performance monitoring and aided chromatographic alignment. Mass accuracy of the instrument was checked on every 3rd day using the vendor specific calibrant (Thermo Fisher Scientific, Breda, The Netherlands). Overall process variability was determined by calculating the median RSD for all endogenous metabolites (i.e., non-instrument standards) present in 100% of the pooled matrix samples. Table 3. Diverse biological processes are influenced by the perturbed metabolites specific for the BECO cancer group. Metabolanalyst software (https:// www. metab oanal yst. ca/) was used for extracting the list of biological processes associated with metabolites of interest. Briefly, the set of 25 metabolite names were added to the metaboanalyst software which cross referenced it to KEGG pathways database and HMDB database to get KEGG and HMDB ids respectively of input metabolites. Following this, the associated biological processes were extracted. www.nature.com/scientificreports/ Experimental samples were randomized across the platform run with QC samples spaced evenly (every 50th sample) among the injections (Fig. 8).
Data processing. The mass spectrometry data was first subjected to preprocessing as shown schematically in Fig. 3. The individual steps were as follows: Incorporating mass errors in the data. Mass errors are known to be present in metabolomics data 35 . This means that the same identified metabolite in different samples would have slightly different mass. This creates problems when intensity of same metabolite has to be compared across samples. This intensity comparison is required in the downstream AI based analysis. Usually, a fixed window size of mass is used to align the samples, but here, we have used a sophisticated approach of using a parts per million (ppm) error-based approach. Briefly, we have adapted the virtual lock mass (VLM) based approach 35 . This is based on the principle that mass errors are known to increase with mass 35 . This, approach was used and adapted according to our datasets. This was done by combining the traditional VLM based approach with metabolite identification from HMDB database. Specifically, the VLM boxes were defined using the masses of metabolites identified by HMDB database search across multiple samples. This resulted in an initial matrix of 6893 metabolites or features. From this, we next removed all features that corresponded to either plant products, or drug and their metabolites. The total number of features in the resultant matrix were reduced to 5558.
Metabolite ions filtering. The metabolite ions were filtered based on their frequency of presence in the individual samples. A 20% cutoff was used, wherein those metabolites that were present in less than 20% of the samples were excluded. This resulted in a final matrix size of 2764 features, which was then taken for our subsequent analysis.
Data normalization. Owing to the variations in the metabolic data across various conditions of the mass spectrometer, normalization methods are needed to minimize the variations in the data 36 . Different normalization methods were tried such as Quantile Normalization, Variance Stabilization Normalization, Best Normalization, Probabilistic Quotient Normalization. Quantile Normalization (QN) was selected as the one performing best across various conditions of the experiment. QN method was further adapted to our datasets to enable normalization of new samples with respect to training datasets and testing of one sample at a time.
Missing value imputation. Missing values in untargeted metabolomics data is known to be problematic 37 . A k-nearest neighbors (KNN) approach was applied to impute the missing values in the data to make the data more homogenous and amenable to AI based analysis.
AI modeling of the data. Now with the above data, AI models were made to differentiate cancers from normal and then between the individual cancers. Keeping in mind clinical applications of our AI model, a layered approach was used here in which first, an AI model was developed to differentiate BECO cancers from normal controls and then individual cancers. Briefly, logistic regression models were applied on the training data and tested on test data to give accuracy, sensitivity and specificity values according to formulae below: Figure 8. Preparation and scheduling of QC and samples for UHPLC-MS/MS. A small aliquot of each sample (coloured cylinders) was pooled to create a QC sample (multi-coloured cylinder), which was then injected periodically (every 50th injection) throughout the batch run. Variability among consistently detected metabolites was used to estimate overall process and batch variability. Every sample injection was followed by a blank injection to prevent carryover between the sample runs. www.nature.com/scientificreports/