A non-invasive method for concurrent detection of early-stage women-specific cancers

Gupta, Ankur; Sagar, Ganga; Siddiqui, Zaved; Rao, Kanury V. S.; Nayak, Sujata; Saquib, Najmuddin; Anand, Rajat

doi:10.1038/s41598-022-06274-9

Download PDF

Article
Open access
Published: 10 February 2022

A non-invasive method for concurrent detection of early-stage women-specific cancers

Ankur Gupta¹^na1,
Ganga Sagar¹^na1,
Zaved Siddiqui¹,
Kanury V. S. Rao^1,2,
Sujata Nayak^1,2,
Najmuddin Saquib¹ &
…
Rajat Anand^1,2

Scientific Reports volume 12, Article number: 2301 (2022) Cite this article

2232 Accesses
7 Citations
3 Altmetric
Metrics details

Subjects

Abstract

We integrated untargeted serum metabolomics using high-resolution mass spectrometry with data analysis using machine learning algorithms to accurately detect early stages of the women specific cancers of breast, endometrium, cervix, and ovary across diverse age-groups and ethnicities. A two-step approach was employed wherein cancer-positive samples were first identified as a group. A second multi-class algorithm then helped to distinguish between the individual cancers of the group. The approach yielded high detection sensitivity and specificity, highlighting its utility for the development of multi-cancer detection tests especially for early-stage cancers.

A non-invasive method for concurrent detection of multiple early-stage cancers in women

Article Open access 04 November 2023

Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma

Article Open access 16 July 2020

Multiplexed nanomaterial-assisted laser desorption/ionization for pan-cancer diagnosis and classification

Article Open access 01 February 2022

Introduction

Cancer remains one of the most pervasive causes of death worldwide, and the incidence continues to rise globally^1,2,3. In females, cancer is the second most important cause of death, with about 7 million new cases, and over 3.5 million deaths, being recorded each year⁴. The leading female-specific cancers are breast cancer, cervical cancer, uterine or endometrial cancer, and ovarian cancer^4,5. Of these, breast cancer is the most frequently diagnosed and accounts for 25% of cancer cases, along with 15% of cancer-related deaths among women across the world⁶. In comparison, cervical cancer is the fourth most frequently diagnosed cancer in women, with an estimate of over 500,000 cases worldwide⁷. Endometrial cancer accounts for about 5% and 2% of global cancer incidence and mortality among women⁸, whereas ovarian cancer accounts for about 4% of the women cancers⁹.

Poor prognosis in female-specific cancers is often a result of late-stage presentation, with additional factors such as diagnostic uncertainties and/or diagnostic errors also contributing^10,11,12. Significant improvements in treatment outcome can, however, be accomplished if the cancer is accurately detected at the earliest possible stage. This ensures that cures are more achievable, and the treatment is less morbid¹³. Unfortunately, though, effective screening paradigms exist only for a restricted subset of cancers and these include colonoscopy¹⁴, prostate specific antigen¹⁵, mammography¹⁶, and cervical cytology¹⁷. However, the impact of these tests too has been limited because the efficacy of some of them remain questionable ¹⁸, and also the fact that many patients fail to comply with screening guidelines¹⁹. Similarly, biomarkers based either on DNA or proteins have also not yet yielded accurate tests for early-stage cancer detection. Low penetrance in the risk groups, and/or low concentrations of the cancer markers, have proven to be the confounding factors in this regard^{20,21,22,23,24}. Indeed, diagnoses for most cancers are still prompted by symptoms that only become apparent at the later stages. As a result, development of effective, non-invasive, screening methods for early cancer detection remains one of the foremost challenges facing modern cancer research.

The last decade has seen considerable interest in employing ‘omics’-based approaches for early-stage cancer diagnosis. Attempts have been made to capture cancer-specific molecular alterations though interrogation of either the genome, epigenome, transcriptome, proteome, metabolome, or the lipidome. Of these approaches, metabolomics best reflects changes in phenotype and—therefore—offers the most promising possibilities for translation to clinical application²⁵. Metabolites represent proximal reporters of disease, and the idea that the metabolite composition of biological fluids reflects the health of an individual is now generally accepted²⁶. The application of metabolomics for cancer diagnosis is especially relevant given that cancers are known to possess unique metabolic phenotypes due to altered metabolism. Consequently, the patterns of metabolites that are produced likely encapsulate ‘signatures’ that correlate with either emergence, presence, or behavior of a cancer²⁵. Using biological fluids such as urine, saliva, or serum, metabolite biomarkers have been identified for several cancers²⁷. For example, mass spectrometric studies have identified metabolite changes that are characteristic of breast cancer in blood of patients^28,29. Similarly, studies have revealed that metabolomics of saliva and urine may also be used to distinguish cancer patients from healthy individuals³⁰.

Despite these recent advances in metabolomics-based cancer detection, however, reliable methodologies for accurate diagnosis of early-stage disease are still lacking. In this context, the majority of studies have focused on identifying discrete metabolites, or sets of discrete metabolites, that are either up- or down-regulated in a specific cancer. What has been less explored are approaches that treat metabolome data as analog outputs, from which patterns that characterize a given disease—and its stage—can be extracted. Untargeted metabolomics by liquid chromatography coupled with mass spectrometry (LC–MS) provides for maximal coverage of metabolite species in a sample^31,32. While the resulting data is complex it is, nonetheless, very information rich³³. Such data is readily amenable to analysis using pattern recognition algorithms and, therefore, has the potential for accurately diagnosing the health state of an individual.

In the present report, we describe an integrated method for the simultaneous detection of early stages of the four most prominent women-specific cancers. These cancers are breast cancer, endometrial cancer, cervical cancer, and ovarian cancer (BECO). Our method combines untargeted serum metabolomics with data analysis using a machine learning algorithm to capture the complex metabolite signatures that specifically characterize early stages of the individual cancers. The detection accuracy obtained with this method is significantly superior to that of other existing methods. Additionally, it enables simultaneous screening for all the four cancers in a single analysis.

Results

Characteristics of samples employed for the study

Table 1 details the sample set employed for the present study. The number of samples for each of the target cancers is shown, along with the number of normal control samples. Majority of the samples were from women between the ages of 30–80 years, although a few samples from women in the 20 to 30- and 81 to 90-year age groups were also included (Table 1). Donors were predominantly Caucasian women (87%), with a lesser number of non-White donors, which included African American, Hispanic, and Asian women (Table 1). Thus, the cumulative number of samples employed for the present study was 1369, of which 1119 were derived from the cancers of interest (i.e. breast, endometrial, cervical and ovarian cancer), while the remainder (n = 250) were the normal controls.

Table 1 Demographic, ethnicity, and BMI group of the sample set used in the study.

Full size table

Data generation and analysis

Positive ion mode UPLC-MS/MS interrogation of the serum metabolome of samples described in Table 1 resulted in the detection of > 20,000 spectral features (R_t, m/z pairs). The untargeted metabolomics approach (Fig. 1) generated a large metabolites list, which was further divided into subset of normal control, endometrial cancer, breast cancer, cervical cancer, and ovarian cancer with 5895, 5971, 5982, 6300 and 6336 metabolites respectively. The total number of unique metabolites identified in our study was 7596 in number. Distribution of the number of unique metabolites identified in samples from the normal control, and individual cancer types, as a function of the age groups, is shown in Fig. 2. Subsequent processing of this data through our in-house pipeline, which sequentially involved normalization, gap filling, data transformation, and feature filtering and selection (“Methods”, Fig. 3), resulted in a matrix consisting of 2764 features across 1369 samples.

Out of total 1369 samples, 304 samples were of Endometrial Cancer, 303 Breast Cancers, 250 Cervical Cancer, 262 Ovarian Cancer and 250 Normal Control samples. To determine whether there is any difference in these samples based on metabolite data, the matrix generated above was used. A PLSDA plot was made using the matrix as shown in Fig. 4. The figure shows that cancer samples can be clearly distinguished from the normal control samples. Additionally, encouraging separation was also obtained between the individual cancer subsets (Fig. 4). To quantify how well these can be distinguished, an AI analysis was done on the data as described below to find common patterns in metabolite variations within cancer samples which is different from normal control samples. Briefly, keeping in mind clinical applications of the AI model, a layered approach was used here in which first, an AI model was developed to differentiate the breast, endometrial, cervical, and ovarian (BECO) cancers from normal controls, and then between the individual cancers.

Distinguishing women-specific cancer samples from controls

To distinguish breast, endometrial, cervical, and ovarian cancers as a group (BECO cancers) from normal controls, the data was randomly partitioned into training and test datasets in comparable proportion between the individual BECO cancers and the normal controls. This resulted in 562 BECO Cancer samples and 126 Controls in training set. And 557 BECO Cancer samples and 124 Controls in test set. The AI model was applied on the training set (Supplementary Table S1, Fig. 5A) and tested in the test set to obtain Accuracy, Sensitivity and Specificity values. The logistic regression function was applied on the training data to find a function separating BECO Cancer samples versus Normal Control samples. Class balancing parameters were configured in the model to deal with the imbalance of classes in the training dataset. The trained algorithm finds a score for each of the sample according to the formulae below:

$${\text{y}}\_{\text{score}} = {\text{x}}0 + {\text{x1}}*{\text{I}}_{{1}} + {\text{x2}}*{\text{I}}_{{2}} + {\text{ x3}}*{\text{I}}_{{3}} + \cdots + {\text{x2823}}*{\text{I}}_{{{2823}}}$$

Here, × 0 is a constant number, I_i (1 ≤ i ≤ 2823) is the intensity of metabolite i present in the respective sample. Supplementary Figure S1 gives the value of coefficient xi (1 ≤ i ≤ 2823) for each metabolite.

The evaluation of the trained model as applied on test set for a single partition of data is shown in Fig. 5B. The scatter plot shows the Model Score for Normal Controls and BECO Cancer cases. The model scores are clearly seen to be different between Normal Controls and BECO Cancer samples where on applying a threshold of 5 to differentiate between two types results in a confusion matrix as shown in Fig. 5B. Sensitivity, Specificity and Accuracy were calculated from formulae given in “Methods”, and this resulted in Sensitivity of 98%, Specificity of 98.3%, and an Accuracy of 98%.

Differentiating endometrial, breast, cervical, ovarian from each other

In the second step, another multiclass AI model was layered on top of the first model, which acted on the predicted cancers samples from the first model (breast, endometrial, cervical or ovary) and gave a multiclass score to each sample: one score for each disease class denoting the probability of the sample belonging to the respective disease class.

Here, out of a total of 1119 BECO samples, 304 samples were Endometrial Cancer, 303 were Breast Cancer samples, 250 were Cervical Cancer samples, and 262 were Ovarian Cancer samples. The data was randomly partitioned into training and test datasets in equal proportion as shown in Fig. 6 and Supplementary Table S2. This resulted in 152 Endometrial Cancer samples, 152 Breast Cancer samples, 127 Cervical Cancer samples and 131 Ovarian Cancer samples in the training set, and in 152 Endometrial Cancer samples, 151 Breast Cancer samples, 123 Cervical Cancer samples, and 131 Ovarian Cancer samples in the test set. In addition, a set of 124 normal control samples were added to the test set. Then, a one versus rest (OVR) classifier multiclass classification model was made using the training samples to give AI model—2. Following this, a two layered modeling scheme was applied on the test set. That is, firstly, AI model—1 differentiating BECO versus normal samples was applied on the test set. Then, AI model—2 was applied on the resulting predicted BECO samples. This resulted in four scores for each sample, with each score defining probability of the respective sample belonging to one of the four classes.

For the multi class model: AI model—2, a one versus rest (OVR) classifier multiclass classification model was made using the training samples. The trained algorithm finds 4 scores for each of the sample according to the formulae below:

$$\begin{aligned} & {\text{y}}\_{\text{score1}} = {\text{y}}0 + {\text{y1}}*{\text{I}}_{{1}} + {\text{ y2}}*{\text{I}}_{{2}} + {\text{ y3}}*{\text{I}}_{{3}} + \cdots + {\text{y2823}}*{\text{I}}_{{{2823}}} \\ & {\text{y}}\_{\text{score2}} = {\text{z}}0 + {\text{z1}}*{\text{I}}_{{1}} + {\text{ z2}}*{\text{I}}_{{2}} + {\text{ z3}}*{\text{I}}_{{3}} + \cdots + {\text{z2823}}*{\text{I}}_{{{2823}}} \\ & {\text{y}}\_{\text{score3}} = {\text{a}}0 + {\text{a1}}*{\text{I}}_{{1}} + {\text{ a2}}*{\text{I}}_{{2}} + {\text{ a3}}*{\text{I}}_{{3}} + \cdots + {\text{a2823}}*{\text{I}}_{{{2823}}} \\ & {\text{y}}\_{\text{score4}} = {\text{b}}0 + {\text{b1}}*{\text{I}}_{{1}} + {\text{ b2}}*{\text{I}}_{{2}} + {\text{ b3}}*{\text{I}}_{{3}} + \cdots + {\text{b2823}}*{\text{I}}_{{{2823}}} \\ \end{aligned}$$

Here, y0, z0, a0, b0 are constant number, I_i (1 ≤ i ≤ 2823) is the intensity of metabolite i present in the respective sample. Supplementary Figures S2 gives the value of coefficient yi, zi, ai, bi (1 ≤ i ≤ 2823) for each metabolite.

To determine how well our multiclass model differentiates between the individual disease categories of the BECO group of samples, as well as from normal control, we plotted the scores obtained from multiclass model. As shown in Fig. 7A, we plotted the multiclass model Endometrial Score for Endometrial Cancer samples and set of Breast, Cervical and Ovarian (BCO) Cancer samples. The model scores are clearly seen to be different between Endometrial and BCO Cancer samples where on applying a threshold to differentiate between the two sets results in a confusion matrix as shown in Fig. 7A. Here, the normal samples were also included in the control group to get the sensitivity, specificity values for Endometrial cancer versus the rest of the groups. Sensitivity, Specificity and Accuracy were calculated from formulae given in “Methods”, which resulted in Sensitivity of 87%, Specificity of 93%, and an Accuracy of 91.6%.

We next plotted multiclass model taking Breast Cancer Scores for Breast Cancer Samples versus the scores from the combined set of Endometrial, Cervical and Ovarian (ECO) Cancer samples (Fig. 7B). The model scores are clearly seen to be different between Breast Cancer and ECO Cancer samples where on applying a threshold to differentiate between two sets results in a confusion matrix as shown (Fig. 7B). Here, the normal samples are also included in the control group to get the sensitivity, specificity values for Breast cancer versus the remaining groups. Sensitivity, Specificity and Accuracy were calculated from formulae given in “Methods”: AI modeling of the data. This results in Sensitivity of 93%, Specificity of 95%, and an Accuracy of 94.4%.

Figure 7C, shows a plot of the multiclass model where the Cervical Score for Cervical Cancer were compared with the scores from the combined set of Endometrial, Breast and Ovarian (EBO) Cancer samples. The model scores are clearly seen to be different between Cervical and EBO Cancer samples where on applying a threshold to differentiate between two sets results in a confusion matrix as shown. Here, the normal control samples were also added in the control group to get the sensitivity, specificity values for Cervical versus rest. Sensitivity, Specificity and Accuracy were calculated from formulae given in “Methods” as Sensitivity of 87%, Specificity of 90%, and an Accuracy of 87.6%.

Figure 7D shows a plot of the multiclass model taking the Ovarian Score for Ovarian Cancer Samples versus scores from the combined set of Endometrial, Breast and Cervical (EBC) Cancer samples. The model scores are clearly seen to be different between Ovarian Cancer and EBC Cancer samples where on applying a threshold to differentiate between two sets results in a confusion matrix as shown. Here, the normal samples are also added to the control group to get the sensitivity, specificity values for Ovarian versus rest. Sensitivity, Specificity and Accuracy were calculated from formulae given in “Methods”, which resulted in Sensitivity of 86%, Specificity of 93%, and an Accuracy of 92%.

Feature ranking and literature validation of select features

Since our matrix features identify named metabolites from the HMDB database, the resulting model becomes explainable in terms of extracting mechanistic and other relevant insights related to the women-specific cancers. To enable this, we performed feature ranking, where the weights of the individual features from the model were first sorted. Following this, an adding of one feature at a time approach was used until the desired sensitivity–specificity was obtained. The top 100 metabolites obtained for both AI models, AI model—1 and AI model—2 are present in Supplementary Table S3.

Table 2 gives a list of twenty-five metabolites from the top 100 metabolites identified in AI model-1, which contribute to distinguishing the women-specific cancer (BECO) group from normal controls. It is evident that list comprises of diverse classes of metabolites that include lipids, nucleosides/nucleotides, amino acids and modified amino acids, acylcarnitines, steroids and dipeptides. While all of these metabolites have been implicated either in tumor growth or progression (Table 2), it is evident that they span a multiplicity of both anabolic and catabolic pathways (Table 3). This is consistent with emerging evidence that tumor cells have highly complex metabolic requirements, and that numerous pathways are required to complement glucose- and glutamine-dependent biomass production³⁸. Representative metabolites that comprise the signature that helps to distinguish the individual cancers (i.e. breast, endometrial, cervical, or ovarian cancer) are listed in Supplementary Tables 4–7.

Table 2 List of select metabolites involved in the signature for distinguishing the BECO cancer group from normal controls.

Full size table

Table 3 Diverse biological processes are influenced by the perturbed metabolites specific for the BECO cancer group.

Full size table

Discussion

With the increasing burden of cancer mortality in women^39,40, early detection to improve treatment outcomes has now become a priority. Unfortunately, however, reliable and accurate methods for early detection of most cancers are still not available. The problem is further exacerbated, particularly in middle- to low-income settings, by the relatively high costs of cancer screening, especially because current methods largely allow only for diagnosing one cancer at a time. As opposed to this, any method that can simultaneously detect multiple cancers at the early stage would find greater applicability simply because of the fewer number of diagnostic procedures that will need to be undertaken, and also the associated reduction in cost that it will entail. The utility of such a multi-cancer detection test would be further enhanced if it involved a non-invasive procedure, and if the test accuracy were to be high.

In the present report we describe an integrated method that can simultaneously detect Stage 0/I of all four of the women-specific cancers with high sensitivity and specificity. The cancers detected are breast, endometrial, cervical, and ovarian cancers. Our approach combined an untargeted UHPLC-MS/MS analysis of the serum metabolome, with the subsequent interrogation of the data using machine learning algorithms. A key aspect of our data analysis pipeline was the generation of a matrix wherein spectral features from the mass spectrometry profiles of the samples were translated into known metabolites identified using the HMDB database. A PLSDA plot revealed that the information content in the matrix was sufficient to clearly distinguish the cancer groups from the normal control group, and also achieve at least a reasonable degree of resolution between the individual cancer subsets. Consequently, this matrix provided the basis for developing an AI algorithm for early-stage cancer detection. For this, we employed a two-step strategy. In the first step we developed an algorithm (AI model—1) for distinguishing between the cancer samples and normal control samples. As our results show, we were indeed successful in identifying the BECO group of samples with high sensitivity and specificity. Subsequent to this, a second AI model (AI model—2) was developed in order to distinguish between the individual cancers of the BECO group. For this a one-versus-rest (OVR) classifier multiclass classification model was developed and as shown here, this model yielded a reasonably high accuracy in terms of identifying the identifying the tissue of origin of the cancer in samples of the BECO group. Efforts are currently underway to further improve the accuracy of AI model – 2.

Thus, our studies show that combining untargeted metabolomics with machine learning approaches for data analysis provides an attractive way forward for developing highly accurate multi-cancer detection approaches. Furthermore, especially noteworthy about our results is the high detection accuracy obtained for early-stage cancers, which is significantly superior to that of the other approaches being explored to date^42,43,44,45. It will of interest to determine if the scope of this approach can be expanded to include simultaneous detection of additional cancers. In addition, a more extensive sampling of patients across a wider diversity of racial and ethnic groups would also help in determining the robustness of the approach.

Methods

Study design and sample collection

All samples used in this study were purchased from three separate commercial biobanks: Dx Biosamples (San Diego, CA), Reprocell USA Inc. (Beltsville, MD), and Fidelis Research AD (Sofia, Bulgaria). From these sources, we obtained serum samples that were derived from treatment-naïve women patients with Stage 0 or Stage 1 of either breast, uterine, cervical, or ovarian cancer. Clinical profile information on the donors included histological stage and grade, along with TNM classification of the cancer. Further, the HPV status of donors with uterine, cervical, and ovarian cancers was also provided; along with results of the CA-125 tumor marker determination for uterine and ovarian cancer patients. Finally, the breast cancer samples were also provided along with information on presence or absence of the Ki-67, ER, PR, and HER2 markers in the donors. To serve as controls in our assay, we also procured additional sera that were from normal volunteers. The total number of samples across all groups was 1369, and they were stored at − 20° C for the short term prior to use.

Sample accessioning

Samples were inventoried and immediately stored at − 80 °C after receipt. Each sample received was allotted a unique identifier number. This identifier was used to trail all sample handling, tasks, results, etc. The samples (and all derived aliquots) were tracked by the identifier. All samples were maintained at − 80 °C until processed.

Extraction of metabolites from serum samples

Metabolite extraction from serum was achieved as previously³⁴. Briefly, all the serum samples were thawed on ice and mixed properly. 10 µl of each serum sample was taken in microfuge tube (1.5 ml), (Genaxy, Cat No. GEN-MT-150-C. S) and then 30 µl of chilled Methanol, (Merck, Cat.No.1.06018.1000) to the sample, vortexed briefly and then kept at − 20 ℃ for 60 min.

The sample was then centrifuged (Sorvall Legend Micro17, Thermo Fisher Scientific, Cat.No. Ligend Micro 17) at 10,000 rpm for 10 min. After centrifugation 27ul supernatant was collected in separate microfuge tube without disturbing the pellet and dried using Speed Vacuum, (ThermoFisher Scientific, Cat.No. SPD1030-230) at low energy for 30–35 min. Samples pellets were then re-suspended using 30 ul methanol: water (1:1, water: methanol) mixture for injection.

Ultrahigh performance liquid chromatography-tandem mass spectroscopy (UHPLC-MS/MS)

Untargeted metabolomics were performed using Dionex LC system (Ultimate 3000) coupled online with QExactive Plus (Thermo Scientific). Each extracted metabolite sample was injected (10ul for positive ESI ionization) onto Acquity UPLC HSS T3 from Waters (1.8 micron, dimensions – 2.1 × 100 mm, Part No. 186003539), which was heated to 37⁰ C. The flow rate was 0.3 ml/min. Mobile phase A was (water + 0.1% formic acid), and mobile phase B was (methanol + 0.1% formic acid). The mobile phase was kept isocratic at 5% B for 1 min, and was increased to 95% B in 7 min and kept for another two min at 95% B, the mobile phase composition returned to 5% B in 14 min. The ESI voltage was 4 kV. The mass accuracy of QExactive mass spectrometry was less than 5 ppm and calibrated at recommended schedule prior to each batch run. The mass scan range is from 66.7 to 1000 Da, and resolution was set to 70,000. The maximum inject time for orbitrap was 100 ms while, AGC target was optimized with 1e6.

Quality assessment and quality control

Several types of controls were analyzed in concert with the experimental samples: blank gradient runs were provisioned at every alternate sample run; a pooled QC sample generated by taking a small volume of each experimental sample, served as a technical replicate throughout the data set; also allowed instrument performance monitoring and aided chromatographic alignment. Mass accuracy of the instrument was checked on every 3rd day using the vendor specific calibrant (Thermo Fisher Scientific, Breda, The Netherlands). Overall process variability was determined by calculating the median RSD for all endogenous metabolites (i.e., non-instrument standards) present in 100% of the pooled matrix samples. Experimental samples were randomized across the platform run with QC samples spaced evenly (every 50th sample) among the injections (Fig. 8).

Data processing

The mass spectrometry data was first subjected to preprocessing as shown schematically in Fig. 3. The individual steps were as follows:

Incorporating mass errors in the data

Mass errors are known to be present in metabolomics data³⁵. This means that the same identified metabolite in different samples would have slightly different mass. This creates problems when intensity of same metabolite has to be compared across samples. This intensity comparison is required in the downstream AI based analysis. Usually, a fixed window size of mass is used to align the samples, but here, we have used a sophisticated approach of using a parts per million (ppm) error-based approach. Briefly, we have adapted the virtual lock mass (VLM) based approach³⁵. This is based on the principle that mass errors are known to increase with mass³⁵. This, approach was used and adapted according to our datasets. This was done by combining the traditional VLM based approach with metabolite identification from HMDB database. Specifically, the VLM boxes were defined using the masses of metabolites identified by HMDB database search across multiple samples. This resulted in an initial matrix of 6893 metabolites or features. From this, we next removed all features that corresponded to either plant products, or drug and their metabolites. The total number of features in the resultant matrix were reduced to 5558.

Metabolite ions filtering

The metabolite ions were filtered based on their frequency of presence in the individual samples. A 20% cutoff was used, wherein those metabolites that were present in less than 20% of the samples were excluded. This resulted in a final matrix size of 2764 features, which was then taken for our subsequent analysis.

Data normalization

Owing to the variations in the metabolic data across various conditions of the mass spectrometer, normalization methods are needed to minimize the variations in the data³⁶. Different normalization methods were tried such as Quantile Normalization, Variance Stabilization Normalization, Best Normalization, Probabilistic Quotient Normalization. Quantile Normalization (QN) was selected as the one performing best across various conditions of the experiment. QN method was further adapted to our datasets to enable normalization of new samples with respect to training datasets and testing of one sample at a time.

Missing value imputation

Missing values in untargeted metabolomics data is known to be problematic³⁷. A k-nearest neighbors (KNN) approach was applied to impute the missing values in the data to make the data more homogenous and amenable to AI based analysis.

AI modeling of the data

Now with the above data, AI models were made to differentiate cancers from normal and then between the individual cancers. Keeping in mind clinical applications of our AI model, a layered approach was used here in which first, an AI model was developed to differentiate BECO cancers from normal controls and then individual cancers. Briefly, logistic regression models were applied on the training data and tested on test data to give accuracy, sensitivity and specificity values according to formulae below:

	Predicted
Actual		Negative	Positive
	Negative	True negative (TN)	False positive (FP)
	Positive	False negative (FN)	True positive (TP)

$$\mathrm{Accuracy}: \frac{TP+TN}{TP+TN+FP+FN}$$

$$\mathrm{Sensitivity}:\frac{TP}{TP+FN}$$

$$\mathrm{Specificity}:\frac{TN}{TN+FP}$$

Ethics statement

The study is in accordance with relevant guidelines and regulations.

References

Dagenais, G. R. et al. Variations in common diseases, hospital admissions, and deaths in middle-aged adults in 21 countries from five continents (PURE): A prospective cohort study. Lancet 395, 785–794. https://doi.org/10.1016/S0140-6736(19)32007-0 (2019).
Article PubMed Google Scholar
Fitzmaurice, C. et al. Global, regional, and national cancer incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 29 cancer groups, 1990 to 2016: A systematic analysis for the Global Burden of Disease study. JAMA Oncol. 4, 1553–1568 (2018).
Article Google Scholar
Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68, 394–424 (2018).
Article Google Scholar
Torre, L. A., Islami, F., Siegel, R. L., Ward, E. M. & Jemal, A. Global cancer in women: Burden and trends. Cancer Epidemiol. Biomark. Prev. 26, 444–457 (2017).
Article Google Scholar
Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Article Google Scholar
Ghonchen, M., Pournamdar, Z. & Salehiniya, H. Incidence and mortality and epidemiology of breast cancer in the world. Asian Pac. J. Cancer Prev. 17, 43–46 (2016).
Article Google Scholar
Arbyn, M. et al. Estimates of incidence and mortality of cervical cancer in 2018: A worldwide analysis. Lancet Glob Health. https://doi.org/10.1016/S2214-109X(19)30482-6 (2020).
Article PubMed Google Scholar
Zhang, S. et al. Global regional and national burden of endometrial cancer, 1990–2017: Results from the global burden of disease study, 2017. Front Oncol 9, 1440. https://doi.org/10.3389/fonc.2019.01440 (2019).
Article PubMed PubMed Central Google Scholar
Momenimovahed, Z., Tiznobaik, A., Taheri, S. & Salehiniya, H. Ovarian cancer in the world: Epidemiology and risk factors. Int. J. Womens Health 11, 287–299 (2019).
Article Google Scholar
Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 20200. CA Cancer J. Clin. 70, 7–30. https://doi.org/10.3322/caac.21590 (2020).
Article PubMed Google Scholar
Howlader, N. et al. (eds) SEER Cancer Statistics Review, 1975–2014 (National Cancer Institute, 2017).
Google Scholar
Ahlquist, D. A. Universal cancer screening: Revolutionary, rational, and realizable. NPJ Precis Oncol. https://doi.org/10.1038/s41698-018-0066-x (2018).
Article PubMed PubMed Central Google Scholar
World Health Organization. Guide to early cancer diagnosis. https://apps.who.int/iris/bitstream/handle/10665/254500/9789241511940-eng.pdf?sequence=1&isAllowed=y (2017).
Pickhardt, P. J., Hassan, C., Halligan, S. & Marmo, R. Colorectal cancer: CT colonography and colonoscopy for detection—Systematic review and metaanalysis. Radiology 259, 393–405 (2011).
Article Google Scholar
Brawer, M. K. Prostate-specific antigen. Semin. Surg. Oncol. 18, 3–9 (2000).
Article CAS Google Scholar
van den Biggelaar, F. J. H. M., Nelemans, P. J. & Flobbe, K. Performance of radiographers in mammogram interpretation: A systematic review. Breast 17, 85–90 (2008).
Article Google Scholar
Partridge, E. E. et al. Cervical cancer screening. J. Natl. Compr. Canc. Netw. 8, 1358–1386 (2010).
Article Google Scholar
Pinsky, P. F., Prorok, P. C. & Kramer, B. S. Prostate cancer screening—A perspective on the current state of the evidence. N. Engl. J. Med. 376, 1285–1289 (2017).
Article Google Scholar
Subramanian, S., Klosterman, M., Amonkar, M. M. & Hunt, T. L. Adherence with colorectal cancer screening guidelines: A review. Prev. Med. 38, 536–550 (2004).
Article Google Scholar
Goossens, N., Nakagawa, S., Sun, X. & Hoshida, Y. Cancer biomarker discovery and validation. Transl. Cancer Res. 4, 256–269 (2015).
CAS PubMed Google Scholar
Lennon, A. M. et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science https://doi.org/10.1126/science.abb9601 (2020).
Article PubMed PubMed Central Google Scholar
Liu, M. C., Oxnard, G. R., Klein, E. A., Swanton, C. & Seiden, M. V. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 31, 745–759 (2020).
Article CAS Google Scholar
Chen, X. et al. Non-invasive early detection of cancer four years before conventional diagnosis using a blood test. Nat. Commun. https://doi.org/10.1038/s41467-020-17316-z (2020).
Article PubMed PubMed Central Google Scholar
Ren, A. H., Fiala, C. A., Diamandis, E. P. & Kulasingam, V. Pitfalls in cancer biomarker discovery and validation with emphasis on circulating tumor DNA. Cancer Epidemiol. Biomark. Prev. 29, 2568–2574 (2020).
Article CAS Google Scholar
Wang, L., Liu, X. & Yang, Q. Application of metabolomics in cancer research: As a powerful tool to screen biomarker for diagnosis, monitoring and prognosis of cancer. Biomark. J. https://doi.org/10.21767/2472-1646.100050 (2018).
Article Google Scholar
Spratlin, J. L., Serkova, N. J. & Eckhardt, S. G. Clinical applications of metabolomics in oncology: A review. Clin. Cancer Res. 15, 431–440 (2009).
Article CAS Google Scholar
Hwang, V. J. & Weiss, R. H. Metabolomic profiling for early cancer detection: Current status and future prospects. Expert Opin. Drug Metab. Toxicol. 12, 1263–1265 (2016).
Article Google Scholar
Chen, Z., Li, Z., Li, H. & Jiang, Y. Metabolomics: A promising diagnostic and therapeutic implement for breast cancer. Onco Targets Ther. 12, 6797–6811 (2019).
Article CAS Google Scholar
Yuan, B. et al. A plasma metabolite panel as biomarkers for early primary breast cancer detection. Int. J. Cancer 144, 2833–2842 (2019).
Article CAS Google Scholar
Zhang, A., Sun, H., Yan, G., Wang, P. & Wang, X. Metabolomics for biomarker discovery: Moving to the clinic. Biomed. Res. Int. https://doi.org/10.1155/2015/354671 (2015).
Article PubMed PubMed Central Google Scholar
Alonso, A., Marsal, S. & Julia, A. Analytical methods in untargeted metabolomics: State of the art in 2015. Front Bioeng Biotechnol. https://doi.org/10.3389/fbioe.2015.00023 (2015).
Article PubMed PubMed Central Google Scholar
Schrimpe-Rutledge, A. C., Codreanu, S. G., Sherrod, S. D. & McLean, J. A. Untargeted metabolomics strategies—Challenges and emerging directions. J. Am. Soc. Mass Spectrom. 27, 1897–1905 (2016).
Article ADS CAS Google Scholar
Issaq, H. J., Xiao, Z. & Veenstra, T. D. Serum and plasma proteomics. Chem. Rev. 107, 3601–3620 (2007).
Article CAS Google Scholar
Want, E. J. et al. Solvent-dependent metabolite distribution, clustering, and protein extraction for serum profiling with mass spectrometry. Anal. Chem. 78, 743–752 (2006).
Article CAS Google Scholar
Brochu, F. et al. Mass spectra alignment using virtual lock-masses. Sci. Rep. 9(1), 8469. https://doi.org/10.1038/s41598-019-44923-8 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Kohl, S. M. et al. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics 8(1), 146–160. https://doi.org/10.1007/s11306-011-0350-z (2012).
Article CAS PubMed Google Scholar
Gromski, P. S. et al. Metabolites 4, 433–452 (2014).
Article Google Scholar
Boroughs, L. K. & DeBerardinis, R. J. Metabolic pathways promoting cancer cell survival and growth. Nat. Cell Biol. 17, 351–359 (2015).
Article CAS Google Scholar
Torre, L. A., Islami, F., Siegel, R. L., Ward, E. M. & Jemal, A. Global cancer burden in women: Burden and trends. Cancer Epidemiol. Biomark. Prev. 26, 444–457 (2017).
Article Google Scholar
Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2019. CA Cancer J. Clin. 69, 7–34 (2019).
Article Google Scholar
Kakushadze, Z., Raghubanshi, R. & Yu, W. Estimating cost savings from early cancer diagnosis. Data https://doi.org/10.3390/data2030030 (2017).
Article Google Scholar
Cohen, J. D. et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science 359, 926–930 (2018).
Article ADS CAS Google Scholar
Lennon, A. M. et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science https://doi.org/10.1126/science.abb9601 (2020).
Article PubMed PubMed Central Google Scholar
Liu, M. C. et al.. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 31, 745–759 (2020).
Article CAS Google Scholar
Klein, E. A. et al. Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Ann. Oncol. 32, 1167–1177 (2021).
Article CAS Google Scholar

Download references

Author information

These authors contributed equally: Ankur Gupta and Ganga Sagar.

Authors and Affiliations

PredOmix Technologies Private Limited, Tower B, SAS Tower, Medicity, Sector – 38, Gurugram, 122002, India
Ankur Gupta, Ganga Sagar, Zaved Siddiqui, Kanury V. S. Rao, Sujata Nayak, Najmuddin Saquib & Rajat Anand
PredOmix, Inc., 9853 Pacific Heights Blvd., San Diego, CA, 92121-4721, USA
Kanury V. S. Rao, Sujata Nayak & Rajat Anand

Authors

Ankur Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Ganga Sagar
View author publications
You can also search for this author in PubMed Google Scholar
Zaved Siddiqui
View author publications
You can also search for this author in PubMed Google Scholar
Kanury V. S. Rao
View author publications
You can also search for this author in PubMed Google Scholar
Sujata Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Najmuddin Saquib
View author publications
You can also search for this author in PubMed Google Scholar
Rajat Anand
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.S., G.S., and Z.S led all experimental aspects of the study. A.G. and R.A. led all the analytical aspects of the study. R.A., N.S., and K.V.S.R. wrote the paper. R.A., N.S., K.V.S.R. and S.N. designed and supervised all aspects of the study.

Corresponding authors

Correspondence to Najmuddin Saquib or Rajat Anand.

Ethics declarations

Competing interests

A.G, G.S., Z.S. and N.S. are fulltime employees of PredOmix Technologies Private Limited. K.V.S.R., R.A. and S.N. are cofounders and own stock in both PredOmix Technologies Private Limited and PredOmix, Inc. The work described in this report is the subject of an International PCT filing. Application No. PCT/US21/48337.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Supplementary Table 3.

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gupta, A., Sagar, G., Siddiqui, Z. et al. A non-invasive method for concurrent detection of early-stage women-specific cancers. Sci Rep 12, 2301 (2022). https://doi.org/10.1038/s41598-022-06274-9

Download citation

Received: 11 September 2021
Accepted: 20 January 2022
Published: 10 February 2022
DOI: https://doi.org/10.1038/s41598-022-06274-9

This article is cited by

A non-invasive method for concurrent detection of multiple early-stage cancers in women
- Ankur Gupta
- Zaved Siddiqui
- Najmuddin Saquib
Scientific Reports (2023)
A comparison of different machine-learning techniques for the selection of a panel of metabolites allowing early detection of brain tumors
- Adrian Godlewski
- Marcin Czajkowski
- Michal Ciborowski
Scientific Reports (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Characteristics of samples employed for the study

Data generation and analysis

Distinguishing women-specific cancer samples from controls

Differentiating endometrial, breast, cervical, ovarian from each other

Feature ranking and literature validation of select features

Discussion

Methods

Study design and sample collection

Sample accessioning

Extraction of metabolites from serum samples

Ultrahigh performance liquid chromatography-tandem mass spectroscopy (UHPLC-MS/MS)

Quality assessment and quality control

Data processing

Incorporating mass errors in the data

Metabolite ions filtering

Data normalization

Missing value imputation

AI modeling of the data

Ethics statement

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links