Abstract
A main bottleneck in proteomics is the downstream biological analysis of highly multivariate quantitative protein abundance data generated using mass-spectrometry-based analysis. We developed the Perseus software platform (http://www.perseus-framework.org) to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data. Perseus contains a comprehensive portfolio of statistical tools for high-dimensional omics data analysis covering normalization, pattern recognition, time-series analysis, cross-omics comparisons and multiple-hypothesis testing. A machine learning module supports the classification and validation of patient groups for diagnosis and prognosis, and it also detects predictive protein signatures. Central to Perseus is a user-friendly, interactive workflow environment that provides complete documentation of computational methods used in a publication. All activities in Perseus are realized as plugins, and users can extend the software by programming their own, which can be shared through a plugin store. We anticipate that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Salivary proteins offer insights into keratinocyte death during aphthous stomatitis. A case-crossover study
BMC Oral Health Open Access 11 May 2023
-
Metataxonomic analysis and host proteome response in dairy cows with high and low somatic cell count: a quarter level investigation
Veterinary Research Open Access 04 April 2023
-
Proteomic analysis of sialoliths from calcified, lipid and mixed groups as a source of potential biomarkers of deposit formation in the salivary glands
Clinical Proteomics Open Access 22 March 2023
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout






References
Altelaar, A.F., Munoz, J. & Heck, A.J. Next-generation proteomics: towards an integrative view of proteome dynamics. Nat. Rev. Genet. 14, 35–48 (2013).
Cox, J. & Mann, M. Quantitative, high-resolution proteomics for data-driven systems biology. Annu. Rev. Biochem. 80, 273–299 (2011).
Eng, J.K., McCormack, A.L. & Yates, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994). This publication describes the earliest approach to correlating tandem mass spectra of peptides to theoretical fragment-ion series calculated from in silico digests of known protein sequences with the aim of identifying peptides and proteins.
Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Geer, L.Y. et al. Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–964 (2004).
Craig, R. & Beavis, R.C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).
Bern, M., Cai, Y. & Goldberg, D. Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. Anal. Chem. 79, 1393–1400 (2007).
Craig, R., Cortens, J.P. & Beavis, R.C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 3, 1234–1242 (2004).
Nesvizhskii, A.I., Vitek, O. & Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 4, 787–797 (2007).
Deutsch, E.W. et al. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics Clin. Appl. 9, 745–754 (2015).
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26, 1367–1372 (2008). Perseus has been developed in conjunction with MaxQuant, which comprises a complete quantitative workflow for the analysis of shotgun proteomics data, including support for a large variety of experimental techniques.
Vizcaino, J.A. et al. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 41, D1063–D1069 (2013).
Vizcaíno, J.A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 32, 223–226 (2014).
de Godoy, L.M. et al. Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature 455, 1251–1254 (2008).
Hebert, A.S. et al. The one hour yeast proteome. Mol. Cell. Proteomics 13, 339–347 (2014). In this paper the authors demonstrate that the yeast proteome can be analyzed within a 1-h measurement time, recovering nearly all expressed cellular proteins.
Nagaraj, N. et al. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 7, 548 (2011).
Beck, M. et al. The quantitative proteome of a human cell line. Mol. Syst. Biol. 7, 549 (2011).
Munoz, J. et al. The quantitative proteomes of human-induced pluripotent stem cells and embryonic stem cells. Mol. Syst. Biol. 7, 550 (2011).
Mann, M., Kulak, N.A., Nagaraj, N. & Cox, J. The coming age of complete, accurate, and ubiquitous proteomes. Mol. Cell 49, 583–590 (2013).
Wísniewski, J.R., Hein, M.Y., Cox, J. & Mann, M.A. 'Proteomic ruler' for protein copy number and concentration estimation without spike-in standards. Mol. Cell. Proteomics 13, 3497–3506 (2014).
Cox, J. et al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, yermed MaxLFQ. Mol. Cell. Proteomics 13, 2513–2526 (2014). Here the MaxLFQ algorithm for relative label-free protein quantification is described. It enabled many researchers to conduct large proteomics studies with complex experimental designs without the need for labeling their samples.
Geiger, T., Cox, J., Ostasiewicz, P., Wisniewski, J.R. & Mann, M. Super-SILAC mix for quantitative proteomics of human tumor tissue. Nat. Methods 7, 383–385 (2010).
Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121 (2001). A pioneering method is described for the robust detection of significantly changing biomolecules in large omics data sets. It uses repeated permutations of the data to determine FDRs.
Alter, O., Brown, P.O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 97, 10101–10106 (2000).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005). GSEA is the forerunner of many methods for analyzing molecular profiling data to determine which sets of genes or proteins are correlated with a phenotypic class distinction.
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. 57, 289–300 (1995). In this seminal paper a simple yet powerful procedure is shown to control the FDR for multiple testing of many independent hypotheses.
Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).
Beausoleil, S.A., Villén, J., Gerber, S.A., Rush, J. & Gygi, S.P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 24, 1285–1292 (2006).
Cox, J. et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 10, 1794–1805 (2011).
Olsen, J.V. et al. Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis. Sci. Signal. 3, ra3 (2010).
Sharma, K. et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 8, 1583–1594 (2014).
Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43, D1049–D1056 (2015).
Hornbeck, P.V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512–D520 (2015).
Tyanova, S., Cox, J., Olsen, J., Mann, M. & Frishman, D. Phosphorylation variation during the cell cycle scales with structural propensities of proteins. PLoS Comput. Biol. 9, e1002842 (2013).
Hein, M.Y. et al. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell 163, 712–723 (2015).
Huttlin, E.L. et al. The BioPlex network: a systematic exploration of the human interactome. Cell 162, 425–440 (2015).
Hubner, N.C. et al. Quantitative proteomics combined with BAC TransgeneOmics reveals in vivo protein interactions. J. Cell Biol. 189, 739–754 (2010).
Selbach, M. & Mann, M. Protein interaction screening by quantitative immunoprecipitation combined with knockdown (QUICK). Nat. Methods 3, 981–983 (2006).
Keilhauer, E.C., Hein, M.Y. & Mann, M. Accurate protein complex retrieval by affinity enrichment mass spectrometry (AE-MS) rather than affinity purification mass spectrometry (AP-MS). Mol. Cell. Proteomics 14, 120–135 (2015).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Räschle, M. et al. DNA repair. Proteomics reveals dynamic assembly of repair complexes during bypass of DNA cross-links. Science 348, 1253671 (2015).
Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).
Gauthier, N.P. et al. Cyclebase.org—a comprehensive multi-organism online database of cell-cycle experiments. Nucleic Acids Res. 36, D854–D859 (2008).
Eser, P. et al. Periodic mRNA synthesis and degradation co-operate during cell cycle gene expression. Mol. Syst. Biol. 10, 717 (2014).
Partch, C.L., Green, C.B. & Takahashi, J.S. Molecular architecture of the mammalian circadian clock. Trends Cell Biol. 24, 90–99 (2014).
Robles, M.S., Cox, J. & Mann, M. In vivo quantitative proteomics reveals a key contribution of post-transcriptional mechanisms to the circadian regulation of liver metabolism. PLoS Genet. 10, e1004047 (2014).
Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Ingolia, N.T., Ghaemmaghami, S., Newman, J.R. & Weissman, J.S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011). In this publication a large-scale quantitative analysis of transcription and translation rates is performed, introducing the iBAQ technique for estimating protein abundances from mass-spectrometry data.
Aviner, R., Shenoy, A., Elroy-Stein, O. & Geiger, T. Uncovering hidden layers of cell cycle regulation through integrative multi-omic analysis. PLoS Genet. 11, e1005554 (2015).
Yates, A. et al. Ensembl 2016. Nucleic Acids Res. 44, D710–D716 (2016).
Cox, J. & Mann, M. 1D and 2D annotation enrichment: a statistical method integrating quantitative proteomics with complementary high-throughput data. BMC Bioinformatics 13, S12 (2012).
Deeb, S.J. et al. Machine learning-based classification of diffuse large B-cell lymphoma patients by their protein expression profiles. Mol. Cell. Proteomics 14, 2497–2460 (2015).
Iglesias-Gato, D. et al. The proteome of primary prostate cancer. Eur. Urol. 69, 942–952 (2016).
Tyanova, S. et al. Proteomic maps of breast cancer subtypes. Nat Commun. 7, 10259 (2016).
Vapnik, V.N. The Nature of Statistical Learning Theory (Springer, 1995).
Chang, C.-C. & Lin, C.-J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011).
Hastie, T., Tibshirani, R. & Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2001).
Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, Article17 (2005).
Ideker, T. & Krogan, N.J. Differential network biology. Mol. Syst. Biol. 8, 565 (2012).
Creixell, P. et al. Pathway and network analysis of cancer genomes. Nat. Methods 12, 615–621 (2015).
Hoops, S. et al. COPASI–a COmplex PAthway SImulator. Bioinformatics 22, 3067–3074 (2006).
Angermann, B.R. et al. Computational modeling of cellular signaling processes embedded into dynamic spatial contexts. Nat. Methods 9, 283–289 (2012).
Cowan, A.E., Moraru, II., Schaff, J.C., Slepchenko, B.M. & Loew, L.M. Spatial modeling of cell signaling networks. Methods Cell Biol. 110, 195–221 (2012).
Croft, D. et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 42, D472–D477 (2014).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Tyanova, S. et al. Visualization of LC-MS/MS proteomics data in MaxQuant. Proteomics 15, 1453–1456 (2015).
Ihaka, R. & Gentleman, R. R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299–314 (1996).
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).
Liew, A.W., Law, N.F. & Yan, H. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief. Bioinform. 12, 498–513 (2011).
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).
Hosp, F. et al. A double-barrel liquid chromatography-tandem mass spectrometry (LC-MS/MS) system to quantify 96 interactomes per day. Mol. Cell. Proteomics 14, 2030–2041 (2015).
Stingele, S. et al. Global analysis of genome, transcriptome and proteome reveals the response to aneuploidy in human cells. Mol. Syst. Biol. 8, 608 (2012).
Acknowledgements
This project has received funding from the European Union′s Horizon 2020 research and innovation programme under grant agreement no. 686547 (J.C.) and from the FP7 grant agreement GA ERC-2012-SyG_318987–ToPAG (J.C.).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Schematic representation of the Workflow window in Perseus
All data matrices uploaded in the running session of Perseus and all processing steps are displayed in the order of execution. The workflow allows the users to keep track of all steps in the analysis and to navigate through data matrices and visualization components just by clicking on the respective node in the diagram. The nodes can be modified to contain description and additional information for clarity. If a data matrix node is selected, information about the number of samples and data points is displayed in the most right panel of Perseus. Moreover, if an analysis node is selected, all parameters that were used in that step can be reviewed. Each data matrix, as well as all visualization windows can be exported in publication ready formats. The workflow scheme can be conveniently saved as a pdf file and used as a documentation of all steps of the analysis.
Supplementary Figure 2 Plug-in architecture of Perseus
The current structure of Perseus relies on a data matrix type and various functions for accessing and transforming the matrix are developed. The base code implementing these operations is open source and can be downloaded from GitHub (github.com/JurgenCox/perseus-plugins). The rest of the functionality is organized in two main interfaces: ‘Processing’ and ‘Analysis’ and the resulting module are added to the software core as plug-ins. Developers wishing to extend the software can build upon the main source code and contribute the new plug-ins to our online plug-in store.
Supplementary Figure 3 Missing value imputation
Perseus offers several imputation techniques including a method that draws random values from a distribution meant to simulate expression below the detection limit. The width and the down shift of the distribution can be set to closely represent the missing population. When missing values occur randomly, a distribution similar to that of the measured data is normally used for imputation. In contrast, a frequently used assumption in proteomics experiments is that low expression proteins give rise to missing values, therefore a Gaussian distribution with a median shifted from the measured data distribution median towards low expression should result in accurate imputation of such values. The mode parameter defines the measured data distribution to be used in the calculation of the random distribution. When the samples do not differ largely in their overall distribution, the use of the complete dataset is recommended. The measured distribution is shown in blue and the imputed values in orange. (a) No down-shift and distribution width of 0.5 do not simulate low abundant missing values. (b) Down-shift of 1.8 and distribution width of 0.5 simulate the assumption of low abundant proteins giving rise to missing values. (c) Down-shift of 3.6 and width of 0.5 result in an undesirable bi-modal distribution.
Supplementary Figure 4 Density-enhanced scatterplots between proteome, transcriptome and translatome levels produced by the upload plug-in
Short read NGS data as for instance produced by the Illumina platform can be imported for further analysis in the Perseus workflow. In the example we calculate RPKM values for each gene (Ingolia N. T. et al., Science, 2009) and compare these with iBAQ values calculated by MaxQuant from proteomics data derived from yeast (Kulak N. A. et al., Nature methods, 2014).
Supplementary Figure 5 Augmented data matrix
In addition to the main data matrix, Perseus can make use of background information complementary to the expression columns. (a) Often one of the first processing steps in data analysis is filtering for a minimum number of valid values. As some statistical methods require all values to be present (e.g. PCA) data imputation may be necessary. Upon imputation a second matrix is created in the background storing information of which values were measured and which – imputed and can later be used to highlight or remove the imputed values. (b) In a more advanced filtering option, first a ‘Quality matrix’ is created, which contains additional information about each expression value in the main matrix and which is used for filtering. For example, the number of peptides used for protein quantification can be used to filter proteins, which were identified with less than 2 peptides.
Supplementary information
Supplementary Text and Figures
Supplementary figures 1–5 and Supplementary Table 1 (PDF 1033 kb)
Rights and permissions
About this article
Cite this article
Tyanova, S., Temu, T., Sinitcyn, P. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat Methods 13, 731–740 (2016). https://doi.org/10.1038/nmeth.3901
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3901
This article is cited by
-
A novel isolation method for spontaneously released extracellular vesicles from brain tissue and its implications for stress-driven brain pathology
Cell Communication and Signaling (2023)
-
The double homeodomain protein DUX4c is associated with regenerating muscle fibers and RNA-binding proteins
Skeletal Muscle (2023)
-
Sex-divergent effects on the NAD+-dependent deacetylase sirtuin signaling across the olfactory–entorhinal–amygdaloid axis in Alzheimer’s and Parkinson’s diseases
Biology of Sex Differences (2023)
-
Establishment and characterization of canine mammary tumoroids for translational research
BMC Biology (2023)
-
Heterogeneous effects of individual high-fat diet compositions on phenotype, metabolic outcome, and hepatic proteome signature in BL/6 male mice
Nutrition & Metabolism (2023)