Introduction

Since the successful completion of the Human Genome Project that mapped the whole human genomes, a massive number of genomic markers have been identified and are being applied to medical sciences1. Many of them have been developed as routine tests in the clinic. However, a significant limitation of genomic or transcriptomic profiling studies is that genomic and transcriptomic data, which only provide indirect measurements of cellular states, may not accurately reflect the corresponding protein changes. These data fail to reveal changes in posttranslational modifications (PTMs), including phosphorylation and protein degradation. Therefore, the genomic data alone cannot bring a full picture of the disease mechanisms with comprehensive understanding2. Nowadays, the Human Proteome Project (https://hupo.org/) has been launched to characterize the entire human proteome by advanced proteomic techniques, which is the next major challenge3. The guidelines on interpreting proteomic data have also been published and recently updated4.

Proteomics, as the combination of proteome experimentation and data analysis, analyzes protein composition, structure, expression, modification status, and the interactions and connections between proteins at an overall level5. It offers complementary information to genomics and transcriptomics. It is also essential for generating a map of the complex, interconnected pathways, networks, and molecular systems, which directly control the major life activities such as cell proliferation, differentiation, senescence, and apoptosis. With the substantial improvement of experimental technology over the past decade6, the proteomics methods have been evolved from conventional methods, such as immunohistochemistry (IHC) staining, western blot, and enzyme-linked immunosorbent assay (ELISA), to high-throughput methods such as tissue microarray (TMA), protein pathway array and mass spectrometry7. Those high-throughput proteomics techniques not only decrease analysis time but also increase the accuracy and depth of proteome coverage. With the advents of bioinformatics and modern multi-analytes “omics” technologies (Supplementary Fig. 1), proteomics holds a great promise for uncovering the molecular mechanisms that underlies diseases towards the discovery of novel biomarkers8 and can be used as specific diagnostic assays, prognostic predictors, and therapeutic targets to enhance personalized medicine further9,10.

In this review, we will discuss the advances in high-throughput proteomic techniques, statistics and algorithms, progress in applying proteomics to disease diagnostics, current challenges and future perspectives.

High-throughput proteomic techniques

With the rapid development of high-throughput technology6, several new technologies are widely used in proteomics and metabolomics in recent years. Regardless the specific technique, these global proteomic approaches (Fig. 1) can be divided into three phases, namely discovery, network-analysis and clinical proteomics. Discovery is the initial phase to identify the amino acid sequence and unknown protein structure with qualification11. We then in network-analysis phase build the global signaling networks and investigate the relations among the known proteins to explore the potential biomarkers with verification. Finally, in the clinical proteomics phase12, we develop clinical assays related to the productization of the biomarker or panel fitting the clinical flow. The commonly used high-throughput proteomic techniques include mass spectrometry, protein pathway array, next generation tissue microarrays and Luminex and will be discussed in details below.

Fig. 1: The process of proteomics “from bench to bedside”.
figure 1

The mass spectrometry (MS)-based methods, single-molecule proteomics (SMP) and single-cell proteomics (SCP) have been widely used to identify and quantify new proteins in the initial discovery stage. Protein pathway array (PPA) is a high-throughput technique to explore the regulation of protein-protein interactions, pathway-pathway interactions, and biological functions to find the position of newly discovered protein in the cell signaling networks. Luminex, Meso-scale Discovery (MSD), Simoa and Olink are effective high-throughput methods for clinical validation after the proteomic markers are verified using tissue microarray (TMA).

Mass spectrometry

Mass spectrometry (MS) has been developed as one of the most essential and popular tools to identify proteins and their isoforms, and quantify posttranslational modifications, either via the fragments directly or the specific proteolytic activity responsible for their formation13,14,15. The most significant effect of MS is to discover and detect an intact protein or a subset of composite or surrogate peptides as MS-based quantitative proteomics that traditional immunoassays find incredibly challenging or impossible. MS can be combined with multiple separations and pre-fractionation techniques to identify the target protein/peptide and improve identification accuracies and yields16. For example, two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) is based on electrical charge and molecular weight, while liquid chromatography (LC) based on polarity, electrical charge, and protein molecular weight. For an example of 2D-PAGE, mixtures of proteins are separated by the electrical charge as isoelectric point (pI) in the first dimension and further separated by molecular weight in the second dimension on 2-D gels. The protein samples from different resources, which were labeled by different cyanine dyes such as Cy2b, Cy3, and Cy5 as reporter fluorophores, can be processed in the same 2D-PAGE to purify the target protein and enhance the detection accuracy17. After being digitalized 2D-PAGE by fluorescence scanner and the image analysis18, the interesting or significant spots in the gel are cut out and enzymatically digested to peptides for MS as matrix-assisted laser desorption/ionization-time of flight (MALDI-TOF) MS analysis where each digest yields a peptide mixture that can be analyzed by bottom-up experiment. Although 2D-PAGE has traditionally been used as a standard procedure for proteomics research, gel-based techniques tend to be labor-intensive and time-consuming, and are therefore not suitable for high-throughput proteomics. By contrast, LC or high-performance liquid chromatography (HPLC) allows continuous separation of thousands of proteins from complex mixtures and can be combined with MS as LC-MS for increased throughput19,20,21. Among them, Reversed-phase liquid chromatography (RPLC) is the most commonly used LC-based separation platform. It is characterized by the distribution of compounds between a water-containing mobile phase and a relatively nonselective stationary phase and other chromatography formats can be added prior to the RPLC separation to improve the dynamic range of measurement22.

According to different strategies of processing, MS-based methods can be divided into top-down, bottom-up, and shotgun approaches. In the top-down proteomics method, a full-length protein, which can be subsequently fragmented inside the MS and the masses of the fragments be recorded, is directly sent for MS analysis23. By contrast, proteins are enzymatically or chemically digested into peptides that serve as input to the MS equipment in bottom-up proteomics techniques. Moreover, shotgun proteomics is a particular case of bottom-up proteomics where the whole proteins in a complex mixture, such as serum, urine, and cell lysates, are cut into peptides and followed by multidimensional HPLC-MS, which aims to generate a global profile of protein mixtures as genome “shotgun” sequencing24. On the other hand, the separation of peptides prior to MS is not necessarily needed in the bottom-up strategy. Then MS data is matched to identify the target proteins and their associated modifications in the protein sequence database by data-dependent discovery engines25, which can be divided into peptide scoring, protein scoring, and finally protein inference26.

Protein pathway array

Human diseases, especially cancers, are often a complicated biomedical process attributable to complex protein-based signaling network pathway alterations that control cell behaviors, such as apoptosis, invasion, and metastasis27. Uncovering the underlying changes in multidimensional protein signaling networks not only aids in understanding the molecular mechanisms of pathogenesis but also identifies the characteristic signaling network signatures that are unique for the type or stage of the diseases28,29,30,31,32. Measuring many proteins simultaneously is of great importance to the theory of protein-protein interactions (PPIs) in the signaling network, which is a big challenge to conventional immunoassays such as western blot. Therefore, high-throughput proteomic tools were increasingly used for biomarker discovery in basic, translational and clinical research.

Protein pathway array (PPA)27, a gel-based high-throughput platform, employed antibody mixtures to detect antigens in a protein sample which can be extracted from biopsy or tissue. In this approach, microdissection of tumor tissue could be applied to maximize the proportion of the proteins from tumor tissue instead of the surrounding benign tissue27. Then, the immunofluorescence signals of antibody-antigen reactions are converted to numeric data as the value of protein expression by Quantity One (https://www.bio-rad.com/en-us/category/image-lab-software-suite?ID=5291f579-0715-48f4-b3de-766b92222582) from Bio-Rad. The biomarkers and proteomic networks can be explored and trained after data normalization and appropriate statistic modeling. PPA has been applied to many diseases such as essential thrombocythemia33 and papillary thyroid carcinoma34. Its high-throughput protein profiles in a robust quantitative manner provided an advantage over traditional methods.

Next generation tissue microarrays

Immunohistochemistry staining, as one of the traditional and reliable research methods, uses an enzyme-linked chromogenic substrate for detection and requires microscopic examination35. It remains a time-consuming and subjective process and produces a qualitative or semiquantitative assessment of protein expression because of the nonspecific stain or background noise. As technology advanced and high-throughput demand increased over the last decade, the TMA gradually began to be widely used in both research and clinical fields36,37. TMA contains many small representative tissue cores of formalin-fixed paraffin-embedded (FFPE) or frozen blocks from hundreds of different cases assembled in an array fashion on a single histologic slide, and therefore allows a large-scale antibody-based molecular analysis of multiple samples at the same time37. Therefore, it is a practical and valuable tool to confirm and verify new biomarkers generated from PPA or MS proteomics methods. Thus, it is often used in an independent cohort and identifies the location of the target proteins in the cell membrane, cytoplasm, or nucleus. Since digital pathology with multiple smart microscopes has been developed rapidly in the past year, a new approach of TMA, next-generation tissue microarrays (ngTMAs), was recently created35. It allows annotations to be placed directly on the digital slides for a higher accuracy. Two major advantages of ngTMAs are its time-efficiency and high throughput without major compromise on quality38. Due to its improved sensitivity and rapid, large-scale detection capabilities, ngTMAs has become a powerful tool to improve the quality of TMAs used in clinical and translational research39,40,41 but could be more widely used.

Multiplex bead- or aptamer-based assays

Proteomics plays a critical role in clinical practice, although there are gaps and limitations to translate proteomics from basic molecular research to clinical use. Multiplex bead- or aptamer-based assays have been developed42,43,44,45 but have various sensitivity and specificity. Therefore, caution and in-house validation studies must be used before the assay is applied to clinical samples.

Luminex bead-based array system is increasingly used in protein profiling applications in recent years46,47,48,49,50. It makes the detection of proteomic biomarker panel reliable, fast and able to cope with dynamic changes in the variety of clinical practices51. Luminex uses different, flexible fluorescent-labeled beads that are spectrally distinguishable and coated with a different capture antibody or probe to identify the antigen or mutation in samples. It is able to detect up to 500 analytes (FLEXMAP 3D Platform: https://www.luminexcorp.com/flexmap-3d/) in a single sample using a 96-well plate or 384-well plate. For proteomics usage, megaplex microspheres are tagged to allow fluorescent detection and can be used in the development of the multiplex immuno-assays by labeling multiple target antibodies. After microsphere activation and conjugation reactions, a panel of beads-antibody complexes is mixed and incubated with samples to capture protein analytes in the sample. Then the sandwich structure of bead-antibody-antigen complexes is passed and counted by a flow cytometer using different fluorescent of beads. Therefore, the high-throughput Luminex system has a great potential for fast multiplexed analysis of panels of genetic, proteomic, metabolic biomarkers associated with disease diagnosis, prognosis, and therapeutics in patients.

Another widely used platform is Meso-scale Discovery (MSD) assay (https://www.mesoscale.com/) which may be multiplex, single-plex or ultrasensitive. It has been used mostly for cytokine detection in the mice with type 1 diabetes or radiation treatments, and the astrocytes with neuronal networks52,53,54,55. It has also been compared with other platforms or detection assays. For human cytokine profiling, the MSD assay is more sensitive than Luminex assay but less specific45, while a recent study shows low or no significant correlations for detecting most of the cytokines (except interleukin 6) among Luminex xMAP®, MSD V-Plex® and Quantikine assays43. A study of 38 epileptic children shows a freeze-thaw cycle results in consistent measurements in 46% (6 of 13) of the analytes using Luminex high-sensitivity assay, 11% (1 of 9) using Luminex standard-sensitivity assay, and in no analytes using MSD assay42. Therefore, the Luminex high-sensitivity assay appears to have better precision than the other 2 assays for epilepsy research. For detecting plasma Alpha-Synuclein in Parkinson’s disease patients, MSD assay has a smaller effect size than Quanterix assay but correlated well with Biolegend56.

One of the widely used bead-based multiplex assays is Simoa® (Single Molecule Array, owned by Quanterix)57. It covers 6 disease areas, is customizable, and includes 109 oncology, 26 neurology, 19 immunology, 13 cardiology and 45 infectious disease assays as of May 2022. Their platform can be used to detect 6 to 10 biomarkers in a single test, and can detect as low as 1 fg/mL of proteins. As a highlight of its performance, Simoa® had the highest sensitivity and precision in a comparison of platforms’ performance in post-traumatic stress disorder and Parkinson’s disease,43 as well as the lowest variation and highest effect size in a 3-platform comparison on Parkinson’s disease56.

Antibodies are the primary detection tool in the bead-based assays but face challenges in high-throughput platforms. To meet the challenge, protein-binding reagents are produced such as slow off-rate aptamer58 and have been commercialized as the SOMAscan® assay. The aptamer-based SOMAscan® assay can assess expression of 1,000 to 9,000 antigens in a single test and has an impressive dynamic range (8 orders of magnitude), a great sensitivity (lower detection limit, 40 fM) and a high precision (median coefficients of variance = ~5%)59,60. In a study on embryonic stem cells, SOMAscan has a higher reproducibility, a higher sensitivity and a larger dynamic range than nano LC-MS/MS and RNA sequencing, but fewer features to detect61. The SOMAscan’s results are overall comparable to those of nano LC-MS/MS and RNA sequencing61. In a study on patients with end-stage renal disease, SOMAscan correlated very well with ELISA in detecting 2 of the 3 targeted proteins, but not in the last one62. However, compared with antibody-based Olink platform, SOMAscan® assay shows a wide range of correlation in assessing protein expression in 2 cohorts of chronic obstructive pulmonary disease and thus should be used with caution63. Moreover, despite the greater coverage and overall good correlation, SOMAscan did not reveal bigger odds ratios of the proteins linked to acute kidney injury than those revealed using one of the immunoassays including MSD (electro-chemiluminescence platform), Access (paramagnetic-chemiluminescence platform) and Unicel (chemiluminescence platform) and Biochip (multiplexed ELISA platform)60.

Proximity extension assay (Olink)

Proximity extension assay, as one of proximity-dependent ligation assays, is based on oligonucleotide-linked antibody pairs that have slight affinity to each other64,65,66. When these oligonucleotide-linked antibodies are brought in proximity, the two unique oligonucleotides linked to the antibodies will be extended by a DNA polymerase and amplified exponentially later65. Quantitative real-time PCR is often used to amplify and quantify the oligonucleotides in the sample. Thus, oligonucleotides can serve as a unique surrogate marker of specific antigens which the antibodies recognize. As described in its original reports64,67, the 5 specific assay steps include: 1. Oligonucleotide-linked antibody pairs are added into the sample; 2. The probe pairs bind to the antigen and subsequently the probe oligonucleotides are brought into close proximity; 3. The oligonucleotides form pair-wise binding of matching probe-pairs; 4. The matching probe-pairs are amplified using universal primers. The process is termed as pre-amplification due to the lack of specific primers; 4. The matching probe-pairs are digested using uracil-DNA glycosylase and unbound universal primers are removed; 5. The pre-amplified probe-pairs (DNA templates) are quantified using specific primers and quantitative real-time PCR. Multiplex and 96-muliplex detection methods have been developed and also show very high sensitivity and specificity64,67. The Olink assay has been applied to several clinical fields with great success, including coronavirus disease 2019, traumatic brain injury and renal diseases68,69,70,71. It can simultaneously quantify over 3,000 proteins in a miniscule amount of sample (e.g., a few microliters).

Nanopore based single-molecule proteomics

Nanopore has been increasingly used for DNA or RNA sequencing and tried on proteomics72,73,74. Its early application to proteomics was seen in sequencing peptides of mycobacterium75. Peptide-oligonucleotide conjugates and measurements with nanopore-induced phase-shift sequencing were used and seemed able to sequence short peptides76. Later, addition of helixase was found effective to reduce the reading error rate to 30 rereads per million77. It is also proposed to combine nanopore with other techniques such as fluorescence labeling and protein-fragmentation for better readouts76. The major challenges of nanopore based single-molecule proteomics are low efficiency and lack of sufficient sensitivity for detecting PTM76.

It is noteworthy that the comparisons of these platforms may not be representative of the whole menu of a given technology, and thus should be applicable only to the aforementioned disease-specific areas. For example, the performance of Simoa® on other diseases may not be as good as that on Parkinson’s disease. Thus, caution and in-lab comparison may be warranted.

Statistics and algorithms

Traditional statistics methods, such as Student’s t-test and one-way analysis of variance (ANOVA), have various biases and may be time-consuming to handle big data78. Therefore, new high-throughput approaches or machine learning-based algorithms (Fig. 2) are needed to process big data that are generated from multi-omics79. Machine learning can be divided into supervised learning and unsupervised learning approaches generally80. In terms of supervised learning, it applied a “labeled” training set to train a model and predict a qualitative or quantitative output, such as classification and regression. By contrast, unsupervised learning has an unlabeled output set and enables the algorithms to determine and identify the natural patterns with shared similarities in an unknown dataset, such as cluster analyses80,81. Artificial intelligence and digital pathology are involving rapidly and will play an even more important role in research, pathology and medicine81,82,83 while some traditional statistic tools remain important such as normalization and batch effect removal.

Fig. 2: The flow chart of data analysis.
figure 2

Normalization is the most significant step after acquiring the raw data. Data can be analyzed according to specific study design and available clinical information and it can be based on the raw data after normalization or a result from other analyses. For example, the clustering analysis can be performed on raw data or the proteins that have significant changes after SAM.

Normalization

The most common and necessary form of big data pre-processing phase is normalization, which is being used to centralize and rescale all of the data as a whole numerical matrix to improve their numeric stability, overall performance and model fitting78,84. All machine learning-based statistic models, such as distance-based cluster analysis, regression, and principal component analysis, are susceptible to unscaled data distribution. For example, the most commonly used formula of normalization is Z-score, which is also called the standard score85. Z-scoring the data centers the raw data by subtracting the mean (average) of a group of values of expression of genes or proteins first to reduce the influence of an extreme outlier that could affect the mean of a dataset with a small number of samples and then divides each data variable by the standard deviation (SD) to scale the data variable. Furthermore, common housekeeping genes and proteins, including GAPDH (glyceraldehyde-3-phosphate dehydrogenase) and beta-actin, whose expression is considered the same in all samples, can also be used to normalize array data as a house-keeping gene. Besides, another preferred treatment for the ratio data is to log the data but not to Z-score the data.

Significance analysis of microarrays

Significance Analysis of Microarrays (SAM) (https://statweb.stanford.edu/~tibs/SAM/) is a supervised learning program for large-scale gene or protein expression data mining developed by the Stanford University Statistics and Biochemistry Labs. SAM, a Microsoft Excel add-in package, is a widely used high-throughput permutation-based approach to identify differentially expressed proteins between sets of samples in abundance proteomics data using modified t-statistics (q-value) which measures the strength of the relationship between protein abundance and disease outcome86. Unlike the regular t-test for small sample size, SAM algorithm is an excellent fit for big data to minimize the number of false positives and negatives by permuting the columns of the protein abundance and automatic imputation of missing data via the nearest neighbor algorithm. Furthermore, one of the SAM’s valuable features is that it gives estimates of the False Discovery Rate using data permutations, which is the proportion of proteins likely to have been identified by chance as being significant (Fig. 3).

Fig. 3: Plotsheet generated by the significance analysis of microarrays: data are presented as a scatter plot of expected (x-axis) vs observed (y-axis) and the solid line indicates the relative difference expression of group.
figure 3

Red color indicates upgrade and green color indicates downgrade. The data points that exceed a threshold from expected relative differences have significant different.

Clustering and discriminant analyses

Hierarchical clustering analysis (HCA) has been used to cluster the big data by forming a mathematical model based dendrogram87,88. Several optimized mathematical formula-based models are created to measure the distance between data points, including Manhattan (L1) distance89, Euclidean (L2) distance90, Pearson correlation91 and others92. The Euclidean distance is the most commonly used but is vulnerable to outliers in non-normal distribution data especially, but might be inferior to Pearson correlation in analyzing proteomic data88. Manhattan distance requires the strict normalization. Pearson correlation is a scale-invariant of the similarity measure, etc88,91. It must be noted that the choice of distance measure impacts the performance of HCA88,92 and thus should be decided with caution. Moreover, there are different principles which can be calculated to measure the distance between clusters, such as average distance, minimum distance way and maximum distance ways. The average distance way uses the average of all data points in one cluster to map to the closet one of the other clusters93. Both distance measure and its calculation formula determine which samples and clusters are grouped together. Based on these 2 metrics, the model is optimized to keep the distance between the data points within one cluster as close as possible in the numerical matrix, but keep the distance between the data points in different clusters as far as possible. Besides, clustering results are also affected by both input data and selected variables, such as feature distributions and biomarkers. For example, the clustering results will be significantly different if samples and biomarkers are added, deleted, and/or replaced. Therefore, essential variables (biomarkers), sample selection criteria and study goals should be clearly defined prior to a HCA for robust and reproducible analysis. In addition, HCA can be divided into one way and two-way HCAs. Two-way HCA indicates that the data is clustered using the X-axis (samples) and Y-axis (biomarkers) at the same time (Fig. 4a) comparing with one way, which means either axis clusters the data according to study design.

Fig. 4: Examples of hierarchical clustering analysis (HCA) and grid analysis of time-series expression (GATE).
figure 4

a The heatmap of a two-way hierarchical clustering analysis was performed by the Multi Experiment Viewer (MeV) (http://mev.tm4.org/). The color in each square represents a numerical value and the bar is on the top. All samples (x-axis) were clustered into three groups, while all protein markers were clustered into four groups. b Protein markers were cluster by multiple data points using GATE. The multiple data points can be divided by time, dosages, or stages of the disease.

Additionally, there is a particular clustering analysis, called Grid Analysis of Time-series Expression (GATE) (Fig. 4b), to analyze and visualize high-dimensional biomolecular according to time series94. GATE, as an integrated computational software platform, uses a correlation-based clustering algorithm to arrange time series or continuous-time points on a two-dimensional hexagonal array. It dynamically colors individual hexagons according to the expression level of genes or proteins to create animated movies of systems-level molecular regulatory dynamics. Furthermore, GATE allows interactive interrogation of movies against a wide variety of knowledge datasets, such as Protein interaction hubs, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, Kinase enrichment analysis (KEA), WikiPathways pathways, etc, to infer potential regulatory control mechanisms from patterns of correlation. Those dynamic protein-protein interactions and clustering are able to allow investigating the continuous changes of cell lines or animals at different time points and dosages of treatment, the snapshots of the disease progression, and the different stages of cancer.

In contrast to clustering analysis that classifies known samples, (predictive) discriminant analysis classifies unknown samples based on what the algorithm learned and built in the training set95. For example, support vector machines (SVMs), which are not data-type dependent, can be applied to linearly separate the numerical or categorical data, and to identify the potential biomarkers as classifiers96. All samples need first to be divided into two groups as the training and validation sets. Then SVMs are trained in the training set for the most optimized algorithm, will be tested it later in the validation set and will finally produce the prediction rate by comparing the predicted with true values. The results will be affected by both input data (samples) and selected variables (also known as features or factors). In light of two data sets, both the training and validation sets must include all of the types of patient samples to cover any related clinical situations, such as stages, grades, histologic classification and complication, to eliminate the false negative and positive in the clinical practice. For example, the SVM algorithm may not recognize any “new” cases that have not been included in the training set, even if it is as simple as common sense for researchers or physicians. Besides, the samples in the training set need strict rule-in and rule-out criteria as well as keep the samples as many and diverse as possible to achieve the most accurate classification. In addition, the validation set can be the same or part of the training dataset as internal validation for retrospective evaluation, which is an option for a small sample size or population. However, an external validation cohort is recommended for prospective evaluation and increases reproducibility, generalizability and scientific rigor of the study. Several issues and problems of discriminant analysis must be noted and avoided during the analysis such as predictive versus descriptive discriminant analyses and linear versus quadratic models97,98. In summary, the main aim of (machine learning-based or not) discriminant analysis is to devise a computationally effective statistic model to classify multiple groups of subjects and identify the potential classifiers with a higher prediction rate.

Kaplan-Meier (K-M) curve and survival analysis

The Kaplan-Meier (K-M) curve is a time-event statistic method to investigate the relationship between the endpoint event and period of time99. It can be used to evaluate survival time, disease recurrence, clinical trial, animal study, etc. For the survival analysis, data can be classified into two types, including complete data and censored data according to the endpoint event. Death and disease recurrence are the most commonly used endpoint events. Complete data is defined as the event occurrence during the experimental period. By contrast, censored data includes the subjects who were lost to follow up or experienced a non-qualified event before the end of the study. The time starting from a defined point (zero time point) to the occurrence of a given event needs to be measured as input data. The higher the censored data ratio is in the study, the less accurate the results generally are. The K-M estimate is the simplest way of computing survival over time. The steep survival curve indicates a low survival rate or shorter survival period. It indicates that there might be confounding factors or effect modification in the cohort which can be determined by stratified analysis and multivariate analysis, if the survival curves of each group cross and inferential analyses show statistical differences.

The two survival curves can be compared statistically by the rudimentary log-rank (Mann–Whitney U) test, which has been widely used, including Breslow and Tarone with different weight functions during computing100. But they, usually as univariable analysis, do not allow to test the effect of the other disease-related variables. By contrast, Cox proportional hazards regression model, which is often used as multivariable analysis, can test the effect of other variables while identifying the independent variables of disease100. For example, biomarkers can be analyzed alone with other risk factors such as age, gender, smoking history, and stage to determine whether it independently affects the prognosis. Therefore, an in-depth and comprehensive survival study of PPIs or microarray is to perform a log-rank test to identify the biomarkers that have statistically significant first and then analyze it with other risk factors together using the Cox regression model. The results of those double analyses can be classified into three categories: (1) Biomarkers have statistically significant in both of the log-rank test and the Cox regression model. It means those biomarkers affect the prognosis as independent factors. (2) Biomarkers have statistically significant only in the log-rank test but not in the Cox regression model. It means those biomarkers are correlated with risk factors as effect modification to impact the disease of interest and may have confounding factors (Cox regression model may reveal them). (3) If biomarkers have statistically significant only in the Cox regression model but not in the log-rank test, bias or study errors such as confounding bias need to be considered in the study. Moreover, the number of cases as complete data should be at least five to ten times greater than the number of variables as multiple secondary endpoints in the Cox regression model to avoid the type I error.

Besides, many popular regression models are being used to analyze the proteomics or microarray-based big data, and their functions are similar to the Cox regression model in varying degrees. For example, the multivariable logistic regression, which is a supervised classification algorithm, is used to model the relationship between a set of continuous, categorical, or dichotomous independent variables and a dichotomous outcome as a dependent variable without time variable55,80,101,102. Cox regression model incorporates time variable but is not able to process the missing values and censored data during a certain amount of time. The advantages of the logistic regression, in which the underlying concept is quite the same as linear regression, are assumed to be a linear association between the features and dependent variable (also known as outcome or label). However, it does not require that the variables normally distributed in the linear discriminant analysis. In addition, the dependent variable is quantitative in the multiple linear regression, rather than a binary outcome in logistic regression.

Principal component analysis

The main objective of principal component analysis (PCA) (Fig. 5a) is to decrease the dimensionality of the big data by creating a set of new variables, called principal components, to represent the majority of the information within the original dataset103. Those new principal components, which may be uncorrelated with each other, are reducing the complexity and noise of the original dataset while minimizing the loss of information. Technically, the number of new principal components can be equal to the number of variables in the original dataset. However, the contribution of new principal components, which represent the proportion of the original data, decrease sequentially (the first principal component accounts for most of the variability of the original dataset, the second subsequent component accounts for as much of the remaining variability as possible, … until the last component). Therefore, only the first few principal components are the most representative, and this trend of progressively decreasing variability of each principal component can be visualized as the scree plot (Fig. 5b). This statistical approach to lower the dimensional representations within a data set through principal components is useful for the classification and the compression of a large dataset or big data.

Fig. 5: Principal component analysis (PCA).
figure 5

a The PCA mapping was performed using the Partek Genomics Suite (Partek, St. Louis, MO) (https://www.partek.com/partek-genomics-suite/). Patients with different survival status (red represents dead and blue represents alive) were separated by eight proteins. The first principal component is plotted on the X-axis and captures 34.9% of the variance. The second principal component is plotted on the Y-axis and achieves 15.6 % of the variance. b The scree plot represents the contribution of each principal component in PCA, and each principal component’s contribution decreases sequentially.

Ingenuity pathway analysis, gene-set enrichment analysis and circos

Many analytical methods combined with online databases to analyze proteomics and microarray data, and are more suitable for discovering clinical significance rather than in-depth statistical analysis. Three commonly used computational tools are described and may be useful for some studies.

Ingenuity Pathway Analysis (IPA) was a web-based software application for causal analysis using expression datasets104. It is now owned by Qiagen with >109,000 expression datasets and 8.5 million findings (https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-ipa/). It can generate hypothetical molecular interactions to understand cellular processes based on knowledge databases such as Biomolecular Interaction Network Database (BIND) database, Biological General Repository for Interaction Datasets (BioGRID) database, Cognia database, DIP database (Database of Interacting Proteins), IntAct database, Molecular INTeraction database (MINT), Munich Information Center for Protein Sequences (MIPS) database, QIAGEN’s Ingenuity Knowledge Base, etc. Therefore, IPA can simultaneously visualize and analyze cross-database data of genomics, proteomics, and metabolomics data for signaling networks and canonical pathways from integrated various omics formats. The 2-dimensional signaling network offers a landscape survey of multi-omics (Fig. 6a) in which all upgraded and downgraded genes or proteins are visualized and connected or linked based on the latest database, and can be labeled by either function or pathway. The principle of ranking canonical pathways activity contains the research-based changes of each molecule such as the fold changes from PPA or microarray and the database-based the importance of each molecular in each canonical pathway, which is calculated with the Fisher’s exact test as the negative log of this p value (Fig. 6b).

Fig. 6: Examples of ingenuity pathway analysis (IPA) and gene-set enrichment analysis (GSEA).
figure 6

a The signaling networks generated by database-based Ingenuity Pathway Analysis (IPA). The up- and downregulated proteins are represented by molecules in red and green color, respectively. The pathways were labeled outside of the network. b The top canonical pathways that were most significant to the dataset were identified by the IPA. The score assigned to each pathway was presented in –log (p value) using Fisher’s exact test. c The enrichment plot generated by the database-based gene set enrichment analysis (GSEA). The bar in the middle of the figure was labeled in red and blue from left to right, which means risk factor and protective factor separately. The enriched gene set is the IVANOVA_HEMATOPOIESIS_EARLY_PROGENITOR (https://www.gsea-msigdb.org/gsea/msigdb/cards/IVANOVA_HEMATOPOIESIS_EARLY_PROGENITOR), which is a protective factor in this figure.

Gene set enrichment analysis (GSEA) (https://www.gsea-msigdb.org/gsea/index.jsp) is another computational method that provides pathway enrichment tools to help interpret datasets105. This approach focuses on cumulative changes in the expression of multiple genes as a gene set, which shares similar biological function, chromosomal location, or regulation, instead of an individual gene to identify pathways106. One similar web-based GSEA tool is Enrich for analyzing human and mouse data107 and modEnrich for analyzing fish, fly, worm and yeast data108. The most significant advantage of this GSEA method is that it can catch some pathways, in which several genes change in a small amount but in a coordinated way (Fig. 6c). The results reflect many of the complexities of co-regulation and modular expression by enrichment score (ES), corresponding to a weighted Kolmogorov–Smirnov-like statistic.

Additionally, Circos (http://circos.ca/) is a software package for visualizing omics-based data and information in a circular layout109. It has an online version (http://circos.ca/circos_online) which however was nonfunctional as of January 2022. Circos plot can be created for exploring relationships and contributions between canonical pathways and clinical clinicopathological characteristics or risk factors (Fig. 7). Each signaling pathway and clinicopathological category are assigned with a unique color in the figure, and the arcs depict the correlation between the clinicopathological categories and signaling pathways. It not only represents the rank of activity of each canonical pathway in the disease but also illustrates the status of activation of the signaling network in each clinicopathological category. The larger the circumference of the arc, the more active the canonical pathway or the more significant the influence of this clinicopathological category on the signaling network. The area of each colored ribbon delineates the proportion of the signaling pathway that contributes to a particular clinicopathological category.

Fig. 7: Circos plot: Among all eight clinicopathological categories, the gender occupied the most significant proportion of the distribution, suggesting that it is the clinical factor that has the most impact on the signaling network.
figure 7

Among 20 canonical pathways altered in the disease, the HER-2 and p53 are affected most, suggested that they play essential roles in the pathogenesis of the disease.

Single cell proteomics

Single cell proteomics is an emerging technique focused on single cells. It will compete and complement single cell transcriptomics for understanding single-cell biology in the near future. Single cell proteomics recently became a reality when advanced technology showed that peptides in a single cell could be efficiently delivered to the MS instruments110,111. These single-cell MS methodologies can be broadly divided into cell-free and multiplex methods, the latter of which allows proteomic analyses of multiple cells at the same time. The SCoPE2 and Scp are the R-packages for analyzing multiplex single cell proteomic data112,113, while the SCeptre is their counterpart implemented in Python114. Some general proteomic pipelines may also be used to process single-cell proteomic data. They include computational quality control tools115 and a single pipeline (MSnbase) for data processing and visualization116,117.

In conclusion, during the last decade, proteomic technology and research has advanced tremendously. The increasing ability of high-throughput proteomics methods have generated real-time and in-depth datasets. The effective data mining technologies also significantly helped with the pursuit of novel and useful biomarkers, which are essential for disease early-detection and treatment. With the breakthrough of computing power and the rise of artificial intelligence, the role of proteomics has been further expanded. The highly advanced statistic/computational models enable proteomics to be integrated into multi-omics. Under this new trend, proteomics data analysis will be revolutionized for a bigger blueprint with a large amount of clinical and health-related data. It is an exciting time for proteomics developing into an essential new discipline and integrated with other disciplines. Although proteomics has to face emerging challenges during this process, it will move toward more in-depth single-cell biology and individualized precision medicine to boost both basic research and clinical practice to another level.