A non-invasive method for concurrent detection of multiple early-stage cancers in women

Gupta, Ankur; Siddiqui, Zaved; Sagar, Ganga; Rao, Kanury V. S.; Saquib, Najmuddin

doi:10.1038/s41598-023-46553-7

Download PDF

Article
Open access
Published: 04 November 2023

A non-invasive method for concurrent detection of multiple early-stage cancers in women

Ankur Gupta^1,2,
Zaved Siddiqui^1,2,
Ganga Sagar²,
Kanury V. S. Rao^1,2 &
…
Najmuddin Saquib^1,2

Scientific Reports volume 13, Article number: 19083 (2023) Cite this article

1218 Accesses
13 Altmetric
Metrics details

Subjects

Abstract

Untargeted serum metabolomics was combined with machine learning-powered data analytics to develop a test for the concurrent detection of multiple cancers in women. A total of fifteen cancers were tested where the resulting metabolome data was sequentially analysed using two separate algorithms. The first algorithm successfully identified all the cancer-positive samples with an overall accuracy of > 99%. This result was particularly significant given that the samples tested were predominantly from early-stage cancers. Samples identified as cancer-positive were next analysed using a multi-class algorithm, which then enabled accurate discernment of the tissue of origin for the individual samples. Integration of serum metabolomics with appropriate data analytical tools, therefore, provides a powerful screening platform for early-stage cancers.

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Development and validation of a new algorithm for improved cardiovascular risk prediction

Article Open access 18 April 2024

Best practices for single-cell analysis across modalities

Article 31 March 2023

Introduction

Cancer is rapidly emerging as the leading cause of premature death globally^1,2,3. While development of more effective therapies is ongoing, early-stage detection of cancer offers a more viable strategy for reducing disease-related morbidity and mortality^4,5,6,7,8,9. In addition to increasing the likelihood of treatment success, detection of cancer in its early stages also allows for improved quality of life, along with a significant reduction in the cost and complexity of treatment^10,11. Multiple studies have shown that the 5-year survival rates are markedly higher in patients diagnosed with Stage I–II cancers as opposed to those who are diagnosed at Stage III-IV^{12,13,14,15,16,17}. Unfortunately, however, screening tests are available only for a restricted set of cancers and these include breast, colorectal, cervical, lung, and prostate cancers^{18,19,20,21,22}. While tests for these cancers have indeed contributed to reducing cancer-specific mortality^23,24, their impact has remained sub-optimal because the efficacy of some of them remains questionable^25,26,27,28. Furthermore, these screening approaches are designed to individually detect only a single cancer type²⁹. Besides these five cancers, however, diagnosis for the remaining cancer types is still prompted by symptoms that appear only at the later stages of the disease.

Of the various approaches that are currently being marshalled to improve cancer diagnosis³⁰, recent developments in the field of multi-cancer early detection (MCED) have shown promise^{31,32,33,34,35}. MCED screening tests aim to capture signals from cell-free (cf)—or circulating tumour (ct)—DNA, or other circulating analytes shed by tumours into blood, that are associated with multiple cancers. Importantly, these tests also detect those cancers for which ‘standard of care’ screening modalities do not currently exist^{28,32,33,34,35}. MCED tests are now being viewed as viable strategies for enhancing the depth and scope of cancer screening programs, thereby facilitating significant reductions in the cancer death rate. Although MCED tests currently under development do provide grounds for optimism, the fact that biomarker concentrations are low—which then have to be distinguished from the background noise of normal human physiology—have hampered efforts to achieve high detection sensitivity for early-stage cancers^28,35,36. This poses a limitation because an effective screening test must have high detection sensitivity and specificity so that problems due either to under- or over-diagnosis are minimized^37,38.

In a previous study we had adopted an alternate approach wherein we interrogated the serum metabolome for any modulations in metabolite patterns that correlated with the presence or absence of cancer³⁹. Our rationale was derived from the fact that the metabolite composition of biological fluids reflects the health of an individual^40,41. Furthermore, metabolome profiling appeared to be particularly relevant for cancer detection given that metabolic reprogramming is one of the key hallmarks of cancer cells^42,43,44,45. Therefore we had reasoned that, by using appropriate data analysis tools, it should be possible to accurately extract metabolite ‘signatures’ that are characteristic of cancer. Our expectations were indeed borne out by the results obtained³⁹. In that report we had shown, by combining untargeted serum metabolomics with machine learning-based data analysis, that we could detect Stage-0/I of the four female-specific cancers (breast, endometrial, cervical, and ovarian) with an average accuracy of around 98%³⁹. Subsequently, in a follow-up study, we were also able to validate the performance of this test in a clinical setting (manuscript submitted).

The encouraging nature of these earlier results suggested to us that our approach could potentially be developed as an early-stage multi-cancer detection platform. To this end, we first sought to explore whether the scope of this method could be expanded to detect additional cancers in women, especially those in Stage-I of the disease. Results described here reveal that our methodology could indeed be readily adapted to concurrently detect the early stages of a total of 15 cancers in women with high accuracy. At a specificity of 99.3%, the detection accuracy of the individual cancers ranged from 94 to 100%, with an average sensitivity of > 99%. Furthermore, we were also able to successfully identify the ‘tissue of origin’ for the test samples at an overall accuracy of close to 92%.

Results

Details of samples included in the study

The demographic and clinical information of samples included in the study are presented in Table 1 and Supplementary Table 1. The age distribution of these samples ranged from 20 to 90 years. Nearly 95% of the samples came from individuals between the ages of 30 to 80 years, and the remaining 5% was split between individuals from the age groups of 20—30 years and 81–90 years (Supplementary Fig. 1). The majority of samples (92%) were from Caucasian with only 8% of samples coming from non-White who were either Hispanic, Asian, or African American women. The total number of cancer samples was 1926, which included samples from women with either breast, endometrial, cervical, ovarian, lung, AML, thyroid, melanoma, colorectal, kidney, NHL, pancreatic, head & neck, gastric, liver and bile duct cancers. Additionally, we also included 300 samples from healthy volunteers as the normal control subset.

Table 1 Description of the sample set employed in the study.

Full size table

Pre-processing of data prior to AI workflow

An untargeted metabolomics workflow involving positive ion mode ultra-pressure liquid chromatography coupled to mass spectrometry (UPLC-MS/MS) was employed for the individual serum samples described in Table-1. This resulted in > 20,000 spectral features (RT, m/z pairs), which was then further resolved into known metabolites by using the Human Metabolome Database (HMDB). The number of known metabolites obtained by this process for the individual groups of normal control, breast cancer, endometrial cancer, cervical cancer, ovarian cancer, lung cancer, AML, thyroid cancer, melanoma, colorectal cancer, kidney cancer, NHL, pancreatic cancer, head & neck cancer, gastric cancer and liver & bile duct cancer were 2821, 3119, 3209, 3237, 2638, 2238, 2215, 2344, 2622, 2117, 1935, 2033, 2202, 2160, 2116, and 2045, respectively. The cumulative list across all the groups was found to comprise of 8312 unique metabolites, which were then used for further analysis. The distribution of these unique metabolites across the individual sample groups is shown in Fig. 1. We next processed this data through our in-house pipeline that included normalization, gap filling, data transformation, followed by feature filtering and selection (Methods, Fig. 2) to generate a matrix consisting of 5104 features representing the 1926 cancer samples, as well as the 300 normal control samples.

To determine whether the information contained in these features could distinguish between cancer samples and normal controls we first generated a PCA plot of cancer samples and normal controls with and without the QC samples. The initial plot was indicative for class separation (supplementary Fig. 2). Following this generated a PLSDA plot using the matrix. As shown in Fig. 3, the PLSDA plot could clearly differentiate cancer samples from normal control by segregating them into two distinct clusters (R2 = 0.991, Q2 = 0.806). To further develop this into a robust and sensitive method for cancer diagnosis, we resorted to AI analysis. The aim here was to more precisely capture variations in metabolite patterns that characterized the cancer samples on the one hand, and normal control samples on the other. In addition to cancer detection, we were also interested in developing an algorithm that enabled identification of the tissue of origin (TOO) in the case of cancer-positive samples. Accordingly then, we adopted a layered approach where we first focussed on accurately distinguishing cancer samples from the normal control, followed by the development of an algorithm for identifying the TOO of the cancer-positive samples.

Cancer detection artificial intelligence (CDAI) algorithm for distinguishing cancer samples from normal controls

The first step was to develop an algorithm for the differentiation of cancer samples from normal controls. We termed this as the Cancer Detection Artificial Intelligence (CDAI) model. For this, the matrix data was randomly divided into training and test sets in comparable proportion between the individual cancers and the normal controls in order to cumulatively distinguish all 15 cancers listed in Table 1 from normal controls. A total of 150 normal control samples and 966 cancer samples were used as the training set, while the test set was comprised of 150 normal controls and 960 cancer samples (Table 2). The accuracy, sensitivity, and specificity values for the CDAI model were obtained by applying it to the training set and evaluating it on the test set (Table 2 and Fig. 2). To distinguish between cancer samples and normal control, the logistic regression function was applied to the training data.

$$ {\text{y}}\_{\text{score}} = {\text{x}}_{0} + {\text{x}}_{{1}} *{\text{I}}_{{1}} + {\text{ x}}_{{2}} *{\text{I}}_{{2}} + {\text{ x}}_{{3}} *{\text{I}}_{{3}} + \cdots \cdots \, + {\text{x}}_{{\text{n}}} *{\text{I}}_{{\text{n}}} $$

Table 2 Distribution of samples between the training and testing sets for development of the CDAI algorithm.

Full size table

Here, × 0 is a constant number, I_i (1 ≤ i ≤ n) is the intensity of metabolite i present in the respective sample. The total number of metabolites is represented by the symbol n(n ∈ [1000, 5104]). Supplementary Fig. 3 gives the value of coefficient x_i(1 ≤ i ≤ n) for each metabolite.

The model was cross validated across 1000 random train-test split which yielded an average sensitivity, specificity of 99.6 (99.5–99.8), 99.3 (98.9–99.5) at 95 CI respectively. The evaluation of the trained model as applied on a single test set for a single partition of data is shown in Fig. 4. The scatter plot in panel A shows the Model Score for normal controls and cancer cases. It is evident that these scores are clearly different between normal controls and the samples derived from all the different cancer types being tested (Fig. 4A). Application of a threshold of 0 to differentiate between cancer samples and normal controls resulted in the confusion matrix shown in Fig. 4B. From the results depicted in this matrix, the overall cancer detection sensitivity calculated was 99.7% whereas the specificity was 99.3%. The ROC-AUC curve obtained for the CDAI model results is also shown in Fig. 4C. The sensitivity of our CDAI algorithm for correctly identifying samples within each cancer type as cancer-positive is given in Table 3. It is evident from the results shown in this table that, barring one sample from the cervical cancer subset and another from the thyroid cancer subset, all other samples were correctly identified as cancer-positive. These results confirm that our pipeline of untargeted serum metabolomics coupled with data analysis using our CDAI algorithm provides for cancer detection with very high sensitivity and specificity. Importantly, given that the majority of samples across all 15 cancers were either from Stage-0 or Stage-I of the disease, the results in Table 3 also underscore the particular utility of our method for early-stage cancer detection.

Table 3 Accurate identification of cancer-positive samples by the CDAI algorithm.

Full size table

An artificial intelligence algorithm for determination of tissue of origin (TOOAI)

In the second step, our aim was to layer a multiclass AI model (tissue of origin, or, TOOAI model) on top of the CDAI model that would act on the cancer-positive samples from Table 3 to generate a multiclass score for each sample. That is, our aim was to score the relative probability with which the TOO of a given sample corresponded to each of the 15 cancer types that were being tested. Based on this grading then, it should be possible to identify the most likely TOO for that sample.

Our cumulative set of 1926 cancer samples included those from endometrial cancer (n = 304), breast cancer (n = 303), cervical cancer (n = 250), ovarian cancer (n = 262), lung cancer (n = 81), leukemia (n = 71), thyroid cancer (n = 70), melanoma (n = 86), colorectal cancer (n = 87), kidney cancer (n = 80), lymphoma (n = 50), pancreatic cancer (n = 75), liver & bile duct cancer (n = 34), gastric cancer (n = 85), head & neck cancer (n = 88). The matrix data generated for these samples was randomly partitioned into training and test datasets in equal proportion as shown in Fig. 5 and Table 4. Then, a SVM multiclass classification model was made using the training samples to generate the TOOAI algorithm. The TOOAI algorithm was applied on those samples identified as cancer-positive by the CDAI algorithm, which generated 15 scores for each sample. Here, for a given samples, each score defined the probability of that sample belonging to one of the fifteen classes, or cancer types.

Table 4 Distribution of samples between the training and testing sets for development of the TOOAI algorithm.

Full size table

The multiclass classification TOOAI model was made using the training samples. The trained algorithm estimated tissue of origin probability of each of the sample, for each of the 15 cancer types, according to the formulae below:

$$\mathrm{P}(\mathrm{Endometrial})=\frac{1}{1+{e}^{y0+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Breast})=\frac{1}{1+{e}^{a0+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Cervical})=\frac{1}{1+{e}^{a1+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Ovarian})=\frac{1}{1+{e}^{a2+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Thyroid})=\frac{1}{1+{e}^{a3+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{N})=\frac{1}{1+{e}^{an+y1*I1+y2+I2+\cdots \cdots .}}$$

Here, a₀, a₁, a₂,…., a_n are constant number, I_i (1 ≤ i ≤ 8312) is the Normalized intensity of metabolite i present in the respective sample. N is number of cancer type classes included in the training set.

The models were first assessed on the basis of their single class accuracy, wherein the first prediction (i.e. the highest probability score) was taken as the correct identification of the cancer TOO for a given sample. This analysis yielded an average accuracy across 15 cancers of 81% 95 CI (78.9–81.6) (results not shown). To further improve the accuracy, therefore, we considered a double-class prediction model in which the correct TOO likely occurred within the top two predictions from the model, calculated on the basis of the probability functions obtained as defined above. The double class prediction accuracies were evaluated for the test dataset and the confusion matrix for the final prediction is shown in Fig. 6. Double class prediction accuracy was obtained from the model by using the following formula:

$$\mathrm{Accuracy}= \frac{\mathrm{Total \, correctly \, predicted \, sample }(\mathrm{True \, prediction }\cap \mathrm{ Prediction}(\mathrm{1,2})\in \mathrm{max}(\mathrm{P}\left(\mathrm{breast}\right),\mathrm{ P}\left(\mathrm{Uterine}\right),\dots ..,\mathrm{P}(\mathrm{N}))}{\mathrm{Total \, number \, of \, sample \, in \, Cancer \, subclass}}$$

Table 5 gives the results obtained for the double class prediction analysis. The significant improvement in prediction accuracy is evident here, which ranged from a low of 82% for gastric cancer to as high as 100% for Non-Hodgkin’s lymphoma and pancreatic cancer. Of the total of 862 cancer samples that were tested, TOO of 795 were correctly predicted resulting in an average accuracy of 92.2% (Table 5).

Table 5 Determination of the tissue of origin by the TOOAI algorithm.

Full size table

Robustness of the CDAI

We also wanted to assess whether our method was subject to the vagaries of batch specific variability that is often seen in mass spectrometry data⁴⁶. For this, we performed an experiment using a sample set that comprised of a pre-defined number of samples from each of the 15 cancers and normal controls as shown in Table 6 and Supplementary Table 3. This sample set was subsequently analysed over multiple times at intervals of 4–6 weeks, spanning a total period of 18 months. Analysis involved a UPLC-MS/MS run for the individual samples, followed by determination of the cancer-positive versus cancer-negative status with the CDAI algorithm. A total of ten such test runs were performed and the results, in terms of the CDAI accuracy, are shown in Table 6 and illustrated in Supplementary Fig. 4. Importantly, the coefficient of variation for the net sensitivity for cancer detection obtained across these ten test runs was as low as 0.003 (Supplementary Fig. 4), confirming the robustness of our overall methodology. We believe that this is a significant finding from the standpoint of further development of our approach as a possible MCED test.

Table 6 Robustness evaluation of the pipeline for cancer detection.

Full size table

Identification of features critical for cancer detection

Our matrix features were able to recognize named metabolites in the HMDB database. This renders our model results more amenable towards gaining useful insights into the metabolic adaptations that seemingly correlate with cancer development. To facilitate such future analysis, we sought to short-list those metabolites that contributed significantly to the cancer-specific signatures detected by the CDAI algorithm. For this we employed feature ranking, wherein weights of the CDAI model’s individual features—or named metabolites—involved in distinguishing between cancer and normal control samples were first sorted. Subsequently, these features were ranked using the recursive feature elimination technique, which involved elimination of one feature at a time. The top ranking metabolites that resulted from this exercise are listed in Supplementary Table 2.

Discussion

Efforts to improve the detection of cancers at an early stage are currently being spurred by the development of MCED tests based on ctDNA analysis, which can detect multiple cancers with a single blood draw⁴⁷. The attraction posed by such tests is that they facilitate detection of many additional cancers that would otherwise remain undetected until later stages, when prognosis is generally poor⁴⁸. Mathematical modelling has suggested that inclusion of MCED tests to usual care can yield a significant positive effect in terms of substantially reducing the overall cancer mortality⁴⁹. Nonetheless, despite the potential shown by ctDNA-based MCED tests, concerns have emerged that this approach may not represent a satisfactory proxy for biopsies of tumour tissues, especially for early-stage cancer detection. These concerns derive from the fact that the detection sensitivity and/or specificity obtained for early-stage cancers has generally been less than satisfactory^50,51. While the low to negligible concentration of ctDNA present in the early stages of cancer has been identified as the primary cause, ctDNA variability based on type and status of the tumour is also another likely complicating factor^52,53,54.

To circumvent the limitations inherent to ctDNA-based MCED tests, we had adopted an alternate approach that built on the widely accepted notion that metabolites serve better as proximal reporters of disease since their relative abundances are often directly related to pathogenic mechanisms^55,56,57. Accordingly, we employed untargeted metabolomics—using high-resolution mass spectrometry—to generate a dense representation of the serum metabolome. The resulting data was then deconvoluted using a set of machine learning algorithms to first distinguish between cancer-positive and cancer-negative cases, followed by a further analysis of samples from the former group to identify the likely tissue of origin of the cancer. As demonstrated in our earlier report³⁹, this approach was indeed unconstrained by the factors that limit accuracy of ctDNA-based MCED tests, at least for the cancers evaluated. An overall detection accuracy of as high as 98% was achieved for Stage-0/I of these cancers (³⁹, manuscript submitted). However, since this previous study was restricted to detection of only the four female-specific cancers, we wanted to explore whether additional cancers could also be brought within the scope of this test. In this report we examined the ability of this method to detect a total of fifteen cancers in women, including the four previously evaluated female-specific cancers.

Results described here confirm that our approach of integrating untargeted metabolomics with machine learning-powered data analytics has strong potential for development as a high-fidelity screening test for early-stages of multiple cancers. This is evident from our finding that all 15 cancers tested could be detected with a sensitivity that ranged from 94 to 100%, at a specificity of 99.3%. Importantly, these cancers also included those that are known to be notoriously difficult to detect such as cancers of the pancreas, lung, kidney, ovary, liver, and sarcoma. Besides the fact that the early stages are largely asymptomatic, detection of these cancers is further hindered by their occurrence in tissues that are not readily accessible. The absence of any overt or specific symptoms in the early stages is also a characteristic of many of the remaining cancers in our list, as a result of which they too normally tend to go undetected for long periods of time^58,59,60. Despite these inherent impediments, however, we were able to uniformly detect each of the individual cancers with high sensitivity and specificity. That is, our method facilitates concurrent detection of multiple cancers including those that are intractable to discernment by available screening modalities.

The fact that the samples employed for all cancer types were primarily derived from patients in Stage-I of the disease is another notable aspect of our study. As described in Table 1, while 31% of pancreatic cancer samples were from Stage-I, the proportion of early-stage cancers (Stage-0/I) was between 70 and 80% in the case of melanoma, colorectal, liver and bile, gastric, and head & neck cancers. For kidney cancer 86% of samples employed were derived from Stage-I, while this proportion was 96% for thyroid cancer and 98% (Stage-0/I) for lung cancer. For the remaining cancers, all samples tested were from Stage-0/I of the disease (see Table 1). Thus, given the preponderance of very early-stage cancer samples in our test set, results presented in this report go to further substantiate the unique capability of our method to accurately detect early-stages of at least the spectrum of cancers that were tested. This feature represents a significant advance given that early-stage detection has for long persisted as one of the principle challenges in the field of cancer diagnosis. Our earlier inference that a metabolomics-based approach is not confounded by limitations that plague ctDNA, circulating tumour cells, or protein biomarker-based strategies³⁹, is also reinforced by these findings.

In addition to cancer detection, the inclusion of a multiclass algorithm (TOOAI model) in the data analysis pipeline also allowed us to predict the likely tissue of origin for those samples that proved to be cancer positive. For each test sample, the TOOAI model generated a list of 15 probability scores that defined the likelihood with which the tissue of origin of that sample corresponded to each of the cancer types tested. While an assignment of cancer type simply on the basis of the highest probability score yielded an overall accuracy of about 81%, this could be further enhanced to 92% by considering a double-class prediction where the tissue of origin was circumscribed to within the two most likely cancer types. Thus, tandem analysis of the serum metabolome using two separate algorithms enabled cancer detection to be coupled with localization of the tissue of origin. Complementing cancer detection with identification of the tissue of origin should aid in directing the subsequent tests required for diagnostic confirmation of the cancer.

Although it is an accepted truism that early diagnosis of cancer can save lives this goal has, nonetheless, proven elusive to date. However, results presented in both our previous³⁸ and present study confirm that an interrogation of the serum metabolome, using a machine learning algorithm, for disease-specific metabolite signatures provides a fruitful strategy for detection of early-stage cancers with very high accuracy. Furthermore, by inclusion of a multiclass algorithm to further resolve the cancer-specific metabolite signatures, cancer detection could also be supplemented with tissue of origin identification. Thus, the approach described here clearly has potential for development as a multi-cancer screening test that is especially relevant for early-stage cancer detection and identification. We do acknowledge, however, that more rigorous clinical validation will be required before its potential can be translated into application in the field. Furthermore, the skewed distribution between the cancer and normal control samples in our sample set, as well as our adoption of a supervised approach for building the model, also demand a more rigorous assessment of the test robustness and reproducibility.

Another important question is the likely effect that comorbidities could have on the accuracy of our results. As shown in Supplementary Table 1, several of the donors for our sample set were those afflicted with other metabolism related diseases such as diabetes, heart disease, and hypertension among others (see Supplementary Table 1). The high cancer-detection accuracy that we, nonetheless, obtained suggests that at least these comorbidities do not exert a negative impact on the performance of our algorithm. This, however, does not rule out the possibility that there may be other classes of diseases (e.g. those related to inflammation, aging, etc.) that could affect the outcome. Studies are currently underway to address these diverse issues, and also evaluate the potential of our method as a multi-cancer screening test.

Methods

Sample details

A schematic of overall methodology is illustrated in Supplementary Fig. 5. A total of 1926 different cancer samples (Breast, Endometrial, Cervical, Ovarian, Lung, AML, Thyroid, Melanoma, Colorectal, Kidney, NHL, Pancreatic, Head & Neck, Gastric and Liver & bile duct) were taken to perform this study (Table 1, Supplementary Table 1). These samples were purchased from different biobanks such as Dx Biosamples (San Diego, CA), Reprocell USA Inc. (Beltsville, MD), and Fidelis Research AD (Beltsville, MD) (Sofia, Bulgaria). Samples including both cancer and healthy controls were from these three biobanks. The distribution of total number of cancer samples among these biobanks were 742 from Dx Biosciences, 807 from Reprocell USA and 377 from Fidelis Research AD. While, 150 Normal samples from Dx Biosciences, 100 from Reprocell USA and 50 from Fidelis Research AD. Samples obtained were categorically from treatment naïve women patients in various stages of the individual cancers (Table 1). Clinical information of the samples that included histological stage and grade, along with cancer’s TNM classification of the cancer.

Sample indexing

A unique identification number was used to index the samples. To correctly assign samples for extraction and relative registering of result output, this number was used. This number was used to track and recall each sample with derived aliquots. Samples were kept at −80 °C until they were processed.

Extraction of metabolites from serum samples

Extraction of metabolites from serum was done as previously described³⁹. Briefly, serum samples were thawed on ice and then mixed prior to extraction. For metabolite extraction, 10 µl of serum was aliquoted into a 1.5 ml microcentrifuge tube (Genaxy, Cat No. GEN-MT-150-C. S). To this, 30 µl of chilled methanol (Merck, Cat. No. 1.06018.1000) was added and briefly vortexed. This mixture was then kept at −20 °C for 60 min.

After the incubation, the sample was centrifuged (Sorvall Legend Micro17, Thermo Fisher Scientific, Cat. No. Ligend Micro 17) at 10,000 rpm for 10 min. Supernatant (27 µl) was then carefully aspirated into a fresh microfuge tube without disturbing the pellet. Speed vacuum (ThermoFisher Scientific, Cat. No. SPD1030-230) was employed at low energy for 30 min to dry the supernatant. This dried sample was either stored at −80 °C for later use or reconstituted immediately with 30 µl methanol: water (1:1) for LCMS injection.

Liquid chromatography and mass spectrometric (LC–MS/MS) analysis of serum metabolites

An untargeted metabolomics approach was adopted for this study³⁹. In this method, a scan range (66.7–1000 Da) was typically selected to capture the metabolite pattern in the sample. A Dionex LC system (Ultimate 3000) coupled with a QExactive (Thermo Scientific) mass spectrometer was employed for this analysis. Samples were analyzed using positive polarity in ESI ionization, after injecting 10 µl of sample onto an Acquity UPLC HSS T3 column (Waters, 1.8 micron, 2.1 × 100 mm, Part No. 186003539). The working temperature for this stationary phase was 37 °C. Mobile phase A (water + 0.1% formic acid) and mobile phase B (methanol + 0.1% formic acid) was used for the gradient where the total run time was14 minutes. The gradient was initially held constant for one minute at 5% B. Then, was increased to 95% B in 7 min and was held for another 2 min at 95% B. Finally, gradient returned to 5% B by 14 min. The eluent was connected online to the QExactive source for ionization using 4 kV of voltage. The mass spectrometer was calibrated with the vendor recommended schedule to maintain the mass accuracy of 5 ppm. Optimized resolution for sample run was 70,000 with an AGC target of 1e6.

Maintenance of mass spectrometric data quality

Mass spectrometry data variation was reduced by combining several controls with the experimental samples. Instrument performance and chromatographic alignment over the time was maintained using the QC samples. Additionally, this QC also served as a technical replicate throughout the study. A blank gradient run was incorporated after each sample injection to reduce the carryover problem associated with the stationary phase.

Pre-processing of mass spectrometry data prior to AI workflow

The mass spectrometry data produced frequently varies between batches. We sought to control this variation by incorporating a set of pre-processing steps before applying the AI workflow. While a schematic of the overall process is depicted in Figs. 2 and 5, the individual steps involved were as follows.

1.
Inclusion of mass error in the data: Despite using very rigorous procedure to avoid possible variation in data, mass errors are prevalent in metabolomics data. This error results in slightly different masses for the same metabolite in two samples. This posed a challenge to compare the intensity of the same metabolite across samples. This step of intensity comparison is essential to form patterns that are required in AI data analysis. As previously described³⁹, an approach of adaptive virtual lock mass (VLM) was used to counter such variations. In principle, this approach relies on the fact that mass errors increase with increasing mass. We adapted this approach with our dataset and combined parts per million (ppm) mass errors with the metabolite identified by HMDB database. VLM boxes were created in alignment with the masses of metabolites identified by HMDB database and searched across the batches. The outcome of this exercise was an initial matrix of 8312 metabolites or features. Further, this matrix was trimmed with the removal of both plant or plant-derived, and drug or drug-derived metabolites. This resulted in a refined matrix of 5104 metabolites or features.
2.
Data filtering: The presence of noise in a data set can increase the model complexity and time of learning, which degrades the performance of learning algorithms. Data filtering is a process of noise reduction as well as dimensionality reduction by which an initial set of raw data contains target specific attributes and is reduced to a more manageable data format.
3.
Data normalization/standardization: Normalization techniques are required to reduce the variations in the data since the metabolic data fluctuate under different mass spectrometer parameters. Different normalization methods were tried such as Quantile Normalization, Variance Stabilization Normalization, Best Normalization, Probabilistic Quotient Normalization.

Data standardization is a data processing workflow that converts the structure of different datasets into one common format of data. It deals with the transformation of datasets after the data is collected from different sources and before it is loaded into target systems. Various data standardization methods like standard normalization, L1 and L2 norm standardization were employed in the data set.

A combination of Standardization and Normalization was used for the two-tiered algorithm. We found Quantile normalization was best suited for CDAI based on the accuracy in the training set and across the validation batches. This method was further adapted to our datasets to enable the normalization of new samples with respect to training datasets and testing one sample at a time. For TOOAI the raw data was first transformed using log base 10, and then subjected to Quantile normalization followed by standard scaler standardization.
4.
Missing value imputation: It is well established that missing values in untargeted metabolomics data can be troublesome. In large metabolite panels, measurement values are frequently missing and, if neglected or sub-optimally imputed, can cause biased study results. Various supervised and unsupervised multiple imputation techniques like Iterative Imputer, missforest, simple impute, KNN impute were employed and the effects of sample size, percentage missing, and correlation structure on the accuracy of the imputation methods were evaluated. Finally, KNN imputation (n_neighbours = 5) was chosen out to be the most appropriate for our dataset. For CDAI we imputed the whole dataset uniformly. However, we followed selective imputation for TOOAI algorithm, where we selectively imputed 15 cancer classes in the training set, but the test set was kept non imputed.
5.
Feature reduction: Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. This is a critical step in high dimensional data as it takes care of curse of dimensionality, multi-collinearity, noise, computational cost, and visualization. Feature Extraction can be unsupervised (PCA) or supervised (LDA, PLS-DA etc.). Various Feature reduction techniques were evaluated based on data variance capture and class separation namely PLSDA R2 maximization, RFE, PCA, Non-negative Matrix Factorization, LDA. These were evaluated on the basis of their effect on overall accuracy in CDAI classification and, finally, PLSDA was used for feature reduction in CDAI as well as for the TOOAI.
6.
Machine learning model development: After completing the above pipeline, the data was then fed into the AI machinery. AI models were made to differentiate cancers from normal and then between the individual cancers.

Keeping in mind the potential clinical applications of our data analysis pipeline, a tiered approach was used here in which an AI model was first developed for cancer signal detection (the CDAI Model). Following this, the TOOAI Model was developed to classify the tissue of origin for the cancer positive sample. The total 2226 samples taken for this study included endometrial cancer (n = 304), breast cancer (n = 303), cervical cancer (n = 250), ovarian cancer (n = 262), lung cancer (n = 81), leukemia (n = 71), thyroid cancer (n = 70), melanoma (n = 86), colorectal cancer (n = 87), kidney cancer (n = 80), lymphoma (n = 50), pancreatic cancer (n = 75), liver & bile duct cancer (n = 34), gastric cancer (n = 85), head & neck cancer (n = 88), and 300 normal control samples. The matrix produced from the data generated from these samples was then utilized for further analysis as described in “Results”.

Development of the algorithm for the CDAI model

Of the total of 2226 samples, 1926 samples were from the 15 cancer classes and 300 were normal controls. Normal controls were samples from volunteers who had no cancer. The data was randomly partitioned into training and test datasets in equal proportion. This resulted in 966 Cancer samples and 150 Normal Controls in training set, and 960 Cancer samples and 150 Normal Controls in test set (Table 2). A complete schematic of the steps for cancer detection is shown in Fig. 2 and the model was evaluated using parameters log loss, Accuracy, Sensitivity, Specificity.

Parametric machine learning models were applied on the training data to obtain a score function depending on the intensity values of the features. The Class balancing parameters were configured in the model to deal with the imbalance of cancer and the control samples in the training dataset. The final trained model generated a score of each sample by using the following formulae:

$$ {\text{y}}\_{\text{score}} = {\text{x}}_{0} + {\text{x}}_{{1}} *{\text{I}}_{{1}} + {\text{ x}}_{{2}} *{\text{I}}_{{2}} + {\text{ x}}_{{3}} *{\text{I}}_{{3}} + \cdots \cdots \, + {\text{x}}_{{\text{n}}} *{\text{I}}_{{\text{n}}} $$

Here, × 0 is a constant number, I_i (1 ≤ i ≤ n) is the intensity of metabolite i present in the respective sample. The total number of metabolites is represented by the symbol n(n ∈ [1000,5104]). Supplementary Fig. 3 gives the value of coefficient x_i(1 ≤ i ≤ n) for each metabolite.

The model was cross validated using 1000 random train test split and the average sensitivity, specificity, and accuracy at 95 CI was obtained. The y score plot of the trained model as applied on test set for a single partition of data containing 15 cancer classes and normal control is shown in Fig. 4. The scatter plot shows the Model Score for Controls and Cancer cases. The ROC-AUC probability curve showed a high degree of separability between the cancer and the normal controls. The model scores are clearly seen to be different between Controls and Cancer samples where on applying a threshold of y-score of zero to differentiate between two types of results in a confusion matrix as shown. Sensitivity, Specificity, and Accuracy can be calculated from the below formulae:

$$\mathrm{Accuracy}: \frac{TP+TN}{TP+TN+FP+FN}$$

$$\mathrm{Sensitivity}: \frac{TP}{TP+FN}$$

$$\mathrm{Specificity}: \frac{TN}{TN+FP}$$

	Predicted
		Negative	Positive
Actual	Negative	True negative (TN)	False positive (FP)
Actual	Positive	False negative (FN)	True positive (TP)

Development of the algorithm for the TOOAI model

In brief, the TOOAI model is a multiclass algorithm that evaluates the probability score for each cancer positive sample, which defines the tissue from which the cancer positive signal has originated. For developing this algorithm the dataset containing the cancer samples was first processed according to the steps explained in the earlier section. Here, out of total 1926 Cancer samples, samples were Endometrial Cancer, Breast Cancer, Cervical Cancer, Ovarian Cancer, Lung Cancer, Kidney Cancer, Thyroid cancer, Acute myeloid lymphoma, non-Hodgkin’s lymphoma, Pancreatic cancer, Colorectal cancer, Liver cancer, Gastric cancer, Melanoma cancer, head & neck cancer (Table 1). The data was randomly partitioned into training and test datasets in equal proportion and complete distribution of training and testing distribution in this layer is shown in Table 4.

The Machine learning environment was set for python 3.10.4. Various algorithms for example Support Vector Machine (SVM), Logistic one versus rest (LOVR), Stochastic gradient descent (SGD) algorithms etc. were evaluated in order to ascertain the best possible model for cancer type identification. The optimal set of hyperparameters for these models were obtained using exhaustive training testing by python Grid search CV package.

This Predict probability output of these models resulted in `15 probability scores for each sample, with each score defining probability of the respective sample belonging to one of the 15 cancer tissue types. The models were assessed based on their single class prediction accuracy and the best model was chosen for further evaluation. Out of these SVM gave the best test accuracy which was further cross validated using 100 random train-test split of the data. The trained algorithm finds tissue of origin probability for each of the sample according to the formulae below:

$$\mathrm{P}(\mathrm{Endometrial})=\frac{1}{1+{e}^{y0+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Breast})=\frac{1}{1+{e}^{a0+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Cervical})=\frac{1}{1+{e}^{a1+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Ovarian})=\frac{1}{1+{e}^{a2+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{Thyroid})=\frac{1}{1+{e}^{a3+y1*I1+y2+I2+\cdots \cdots .}}$$

$$\mathrm{P}(\mathrm{N})=\frac{1}{1+{e}^{an+y1*I1+y2+I2+\cdots \cdots .}}$$

Here, a₀, a₁, a₂,…., a_n are constant number, I_i (1 ≤ i ≤ 5104) is the Normalized intensity of metabolite i present in the respective sample. N is the number of cancer type classes included in the training set.

Using the scores for each class obtained we defined a double class prediction accuracy of the model, here the double class prediction accuracy will mean an occurrence of correct prediction in the top two predictions from the model using the above defined probability function.

The double class prediction accuracies were evaluated for the single test dataset as an example and the confusion matrix for the final prediction are shown in Fig. 6. Table 5 shows double class prediction accuracy for the same. The prediction accuracy for the double class prediction from the model were evaluated using the following formulae:

$$\mathrm{Accuracy}= \frac{\mathrm{Total \, correctly \, predicted \, sample }(\mathrm{True \, prediction }\cap \mathrm{ Prediction}(\mathrm{1,2})\in \mathrm{max}(\mathrm{P}\left(\mathrm{breast}\right),\mathrm{ P}\left(\mathrm{Uterine}\right),\dots ..,\mathrm{P}(\mathrm{N}))}{\mathrm{Total \, number \, of \, sample \, in \, Cancer \, subclass}}$$

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

References

Bray, F., Laversanne, M., Weiderpass, E. & Soerjomataram, I. The ever-increasing importance of cancer as a leading cause of premature death worldwide. Cancer 127, 3029–3030 (2021).
Article PubMed Google Scholar
Fitzmaurice, C. et al. Global, regional, and national cancer incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 29 cancer groups, 1990 to 2016: A systematic analysis for the Global Burden of Disease study. JAMA Oncol. 4, 1553–1568 (2018).
Article PubMed Google Scholar
Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68, 394–424 (2018).
Article PubMed Google Scholar
World Health Organization. Guide to Early Cancer Diagnosis. https://apps.who.int/iris/bitst ream/handle/10665/254500/9789241511940-eng.pdf?sequence=1&isAllowed=y (2017).
Ofman, J.J. & Raza, A. Taking Early Cancer Detection to the Next Level. (Scientific American, 2020).
Ofman, J. J. & Raza, A. The cancer detection rate—A public health approach to early detection. Cancer Lett. 46, 48–50 (2020).
Google Scholar
Clarke, C. A. et al. Projected reductions in absolute cancer-related deaths from diagnosing cancers before metastasis, 2006-2015. Cancer Epidemiol. Biomark. Prev. 29, 895–902 (2020).
Article Google Scholar
Hawkes, N. Cancer survival data emphasise importance of early diagnosis. Br. Med. J. 364, l408 (2019).
Article Google Scholar
McPhail, S., Johnson, S., Greenberg, D., Peake, M. & Rous, B. Stage at diagnosis and early mortality from cancer in England. Br. J. Cancer 112(Suppl 1), S108–S115. https://doi.org/10.1038/bjc.2015.49 (2015).
Article PubMed PubMed Central Google Scholar
Etzioni, R. et al. The case for early detection. Nat. Rev. Cancer 3, 243–252 (2003).
Article CAS PubMed Google Scholar
Cancer Research UK. Why is Early Diagnosis Important? https://www.cancerresearchuk.org/about-cancer/cancer-symptoms/why-is-early-diagnosis-important. Accessed 22 Mar 2021 (2021).
Zappa, C. & Mousa, S. A. Non-small cell lung cancer: Current treatment and future advances. Transl. Lung Cancer Res. 5(3), 288–300 (2016).
Article CAS PubMed PubMed Central Google Scholar
The American Cancer Society. “Key Statistics About Stomach Cancer” (2020)
American Society of Clinical Oncology. “Pancreatic Cancer: Statistics” (2020)
Yang, P. Epidemiology of Lung Cancer Prognosis: Quantity and Quality of Life, Methods of Molecular Biology (2020)
NIH. “Cancer Stat Facts: Stomach Cancer” (2020)
Cancer Survival in England: Adult, Stage at Diagnosis and Childhood—Patients Followed up to 2018. (Statistical Bulletin, Office for National Statistics, 2019).
Curry, S. J. et al. Screening for cervical cancer: US preventive services task force recommendation statement. J. Am. Med. Assoc. 320, 674–686 (2018).
Article Google Scholar
Bibbins-Domingo, K. et al. Screening for colorectal cancer: US preventive services task force recommendation statement. J. Am. Med. Assoc. 315, 2564–2575 (2016).
Article CAS Google Scholar
Grossman, D. C. et al. Screening for prostate cancer: US preventive services task force recommendation statement. J. Am. Med. Assoc. 319, 1901–1913 (2018).
Article Google Scholar
Siu, A.L.; On behalf of the U.S. Preventive Services Task Force. Screening for breast cancer: U.S. preventive services task force recommendation statement. Ann. Intern. Med. 164, 279–296 (2016).
Moyer, V. A. et al. Screening for lung cancer: U.S. preventive services task force recommendation statement. Ann. Intern. Med. 160, 330–338 (2014).
PubMed Google Scholar
Smith, R. A. et al. Cancer screening in the United States, 2019: a review of current American Cancer Society guidelines and current issues in cancer screening. CA Cancer J. Clin. 69, 184–210 (2019).
Article PubMed Google Scholar
Daskalakis, C. et al. Predictors of overall and test-specific colorectal cancer screening adherence. Prev. Med. 133, 106022 (2020).
Article PubMed PubMed Central Google Scholar
Black, W. C. Overdiagnosis: An underrecognized cause of confusion and harm in cancer screening. J. Natl. Cancer Inst. 92, 1280–1282 (2000).
Article CAS PubMed Google Scholar
Srivastava, S. et al. Cancer overdiagnosis: A biological challenge and clinical dilemma. Nat. Rev. Cancer 19, 349–358 (2019).
Article CAS PubMed PubMed Central Google Scholar
Goossens, N., Nakagawa, S., Sun, X. & Hoshida, Y. Cancer biomarker discovery and validation. Transl. Cancer Res. 4, 256–269 (2015).
CAS PubMed Google Scholar
Lennon, A. M. et al. Feasibility of blood testing combined with PET-CT to screen for cancer and guide intervention. Science https://doi.org/10.1126/science.abb9601 (2020).
Article PubMed PubMed Central Google Scholar
Hackshaw, A. et al. Estimating the population health impact of a multi-cancer early detection genomic blood test to complement existing screening in the US and UK. Br. J. Cancer https://doi.org/10.1038/s41416-021-01498-4 (2021).
Article PubMed PubMed Central Google Scholar
Atlihan-Gungdogdu, E. et al. Recent developments in cancer therapy and diagnosis. Pharm. Invest. 1, 1. https://doi.org/10.1007/s40005-020-00473-0 (2020).
Article Google Scholar
Liu, M. C., Oxnard, G. R., Klein, E. A., Swanton, C. & Seiden, M. V. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 31, 745–759 (2020).
Article CAS PubMed Google Scholar
Li, B. et al. Abstract A06: Multiplatform analysis of early-stage cancer signatures in blood. Clin. Cancer Res. 26, A06 (2020).
Article Google Scholar
Shen, S. Y. et al. Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature 563, 579–583 (2018).
Article ADS CAS PubMed Google Scholar
Chen, X. et al. Non-invasive early detection of cancer four years before conventional diagnosis using a blood test. Nat. Commun. 11, 3475 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Cohen, J. D. et al. Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science. 359, 926–930 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Klein, E. A. et al. Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Ann. Oncol. 32, 1167–1177 (2021).
Article CAS PubMed Google Scholar
Ren, A. H., Fiala, C. A., Diamandis, E. P. & Kulasingam, V. Pitfalls in cancer biomarker discovery and validation with emphasis on circulating tumor DNA. Cancer Epidemiol. Biomark. Prev. 29, 2568–2574 (2020).
Article CAS Google Scholar
Shieh, Y. et al. Population-based screening for cancer: hope and hype. Nat. Rev. Clin. Oncol. 13(9), 550–565 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gupta, A. et al. A non-invasive method for concurrent detection of early-stage women-specific cancers. Sci. Rep. 12, 2301. https://doi.org/10.1038/s41598-022-06274-9 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Beger, R. D. A review of applications of metabolomics in cancer. Metabolites 3, 552 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wang, et al. Application of metabolomics in cancer research: As a powerful tool to screen biomarker for diagnosis, monitoring and prognosis of cancer. Biomark. J. https://doi.org/10.21767/2472-1646.100050 (2018).
Article Google Scholar
Schmidt, D. R. et al. Metabolomics in cancer research and emerging applications in clinical oncology. CA Cancer J. Clin. 71, 333–358 (2021).
Article PubMed PubMed Central Google Scholar
Zhou, Z., Ibekawa, E., & Chornenkyy, Y. Metabolic alterations in cancer cells and the emerging role of oncometabolites as drivers of neoplastic change. Antioxidants 7(1), 16. https://doi.org/10.3390/antiox7010016 (2018).
Gyamfi, J., Kim, J. & Choi, J. Cancer as a metabolic disorder. Int. J. Mol. Sci. 23(3), 1155. https://doi.org/10.3390/ijms23031155 (2022).
Article CAS PubMed PubMed Central Google Scholar
Beger, R. D. et al. Metabolomics enables precision medicine: “A white paper, community perspective”. Metabolomics https://doi.org/10.1007/s11306-016-1094-6 (2016).
Article PubMed PubMed Central Google Scholar
Phua, S. X., Lim, K. P. & Goh, W. W. Perspectives for better batch effect correction in mass-spectrometry-based proteomics. Comput. Struct. Biotechnol. J. 12(20), 4369–4375. https://doi.org/10.1016/j.csbj.2022.08.022.PMID:36051874;PMCID:PMC9411064 (2022).
Article Google Scholar
Liu, M. C. Transforming the landscape of early cancer detection using blood tests—Commentary on current methodologies and future prospects. Br. J. Cancer 124, 1475–1477 (2021).
Article PubMed PubMed Central Google Scholar
Ahlquist, D. A. Universal cancer screening: Revolutionary, rational, and realizable. npj Precis. Oncol. 2, 23 (2018).
Article PubMed PubMed Central Google Scholar
Hubbell, E., Clarke, C. A., Aravanis, A. M. & Berg, C. D. Modeled reductions in late-stage cancer with a multi-cancer early detection test. Cancer Epidemiol. Biomark. Prev. 30, 460–468 (2021).
Article CAS Google Scholar
Fiala, C. & Diamandis, E. P. Circulating tumor DNA (ctDNA) is not a good proxy for liquid biopsies of tumor tissues for early detection. Clin. Chem. Lab. Med. 58, 1651–1653 (2020).
Article CAS PubMed Google Scholar
Keller, L., Belloum, Y., Wikman, H. & Pantel, K. Clinical relevance of blood-based ctDNA analysis: Mutation detection and beyond. Br. J. Cancer 124, 345–358 (2021).
Article PubMed Google Scholar
Yasai, H. Challenges in circulating tumor DNA analysis for cancer diagnosis. J. Nanomed. Res. 7, 76–80 (2018).
Google Scholar
Fiala, C. & Diamandis, E. P. Utility of circulating tumor DNA in cancer diagnostics with emphasis on early detection. BMC Med. 16, 166. https://doi.org/10.1186/s12916-018-1157-9 (2018).
Article CAS PubMed PubMed Central Google Scholar
Molparia, B., Nichani, E. & Torkamani, A. Assessment of circulating copy number variant detection for cancer screening. PLoS ONE 12, e0180647. https://doi.org/10.1371/journal.pone.0180647 (2017).
Article CAS PubMed PubMed Central Google Scholar
Shao, Y. & Le, W. Recent advances and perspectives of metabolomics-based investigations in Parkinson’s disease. Neurodegeneration https://doi.org/10.1186/s13024-018-0304-2 (2019).
Article Google Scholar
Berger, R. D. et al. Metabolomics enables precision medicine: “A white paper, community perspective”. Metabolomics https://doi.org/10.1007/s11306-016-1094-6 (2016).
Article Google Scholar
Puchades-Carrasco, L. & Pineda-Lucena, A. Metabolomics applications in precision medicine: An oncological perspective. Curr. Top. Med. Chem. 24, 2740–2752 (2017).
Google Scholar
Pashayan, N. & Pharoah, P. D. P. The challenge of early detection in cancer. Science 368, 589–590 (2020).
Article ADS CAS PubMed Google Scholar
Whitaker, K. Early diagnosis: The importance of cancer symptoms. Lancet 21, 6–8 (2020).
Article Google Scholar
Holtedahl, K. Challenges in early diagnosis of cancer: The fast track. Scand. J. Primary Health Care 38, 251–252 (2020).
Article Google Scholar

Download references

Author information

Authors and Affiliations

PredOmix Health Sciences Private Limited, 10 Anson Road, #22-02 International Plaza, Singapore, 079903, Singapore
Ankur Gupta, Zaved Siddiqui, Kanury V. S. Rao & Najmuddin Saquib
PredOmix Technologies Private Limited, Tower B, SAS Tower, Medicity, Sector-38, Gurugram, 122002, India
Ankur Gupta, Zaved Siddiqui, Ganga Sagar, Kanury V. S. Rao & Najmuddin Saquib

Authors

Ankur Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Zaved Siddiqui
View author publications
You can also search for this author in PubMed Google Scholar
Ganga Sagar
View author publications
You can also search for this author in PubMed Google Scholar
Kanury V. S. Rao
View author publications
You can also search for this author in PubMed Google Scholar
Najmuddin Saquib
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.S., G.S., and Z.S. led all experimental aspects of the study. A.G. led all the analytical aspects of the study. N.S., A.G., and K.V.S.R. wrote the paper. N.S., and K.V.S.R. designed and supervised all aspects of the study.

Corresponding author

Correspondence to Najmuddin Saquib.

Ethics declarations

Competing interests

A.G, G.S., Z.S. N.S. and K.V.S.R. are fulltime employees of PredOmix Technologies Private Limited. K.V.S.R. is a cofounder and owns stock in both PredOmix Technologies Private Limited and PredOmix Health Sciences Pte. Ltd. A.G., Z.S., and NS. own stock in PredOmix Health Sciences Pte. Ltd. The work described in this report is included in an Indian Provisional Patent Application filing. Application No. 202311002270.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Figures.

Supplementary Table S1.

Supplementary Table S2.

Supplementary Table S3.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gupta, A., Siddiqui, Z., Sagar, G. et al. A non-invasive method for concurrent detection of multiple early-stage cancers in women. Sci Rep 13, 19083 (2023). https://doi.org/10.1038/s41598-023-46553-7

Download citation

Received: 03 May 2023
Accepted: 02 November 2023
Published: 04 November 2023
DOI: https://doi.org/10.1038/s41598-023-46553-7

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Development and validation of a new algorithm for improved cardiovascular risk prediction

Best practices for single-cell analysis across modalities

Introduction

Results

Details of samples included in the study

Pre-processing of data prior to AI workflow

Cancer detection artificial intelligence (CDAI) algorithm for distinguishing cancer samples from normal controls

An artificial intelligence algorithm for determination of tissue of origin (TOOAI)

Robustness of the CDAI

Identification of features critical for cancer detection

Discussion

Methods

Sample details

Sample indexing

Extraction of metabolites from serum samples

Liquid chromatography and mass spectrometric (LC–MS/MS) analysis of serum metabolites

Maintenance of mass spectrometric data quality

Pre-processing of mass spectrometry data prior to AI workflow

Development of the algorithm for the CDAI model

Development of the algorithm for the TOOAI model

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Figures.

Supplementary Table S1.

Supplementary Table S2.

Supplementary Table S3.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links