Plants are essential for life and are extremely diverse organisms with unique molecular capabilities1. Here we present a quantitative atlas of the transcriptomes, proteomes and phosphoproteomes of 30 tissues of the model plant Arabidopsis thaliana. Our analysis provides initial answers to how many genes exist as proteins (more than 18,000), where they are expressed, in which approximate quantities (a dynamic range of more than six orders of magnitude) and to what extent they are phosphorylated (over 43,000 sites). We present examples of how the data may be used, such as to discover proteins that are translated from short open-reading frames, to uncover sequence motifs that are involved in the regulation of protein production, and to identify tissue-specific protein complexes or phosphorylation-mediated signalling events. Interactive access to this resource for the plant community is provided by the ProteomicsDB and ATHENA databases, which include powerful bioinformatics tools to explore and characterize Arabidopsis proteins, their modifications and interactions.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The data supporting the findings of this study are available within the paper, the Supplementary Information and the public repositories. Source Data for Figs. 1–5 and Extended Data Figs. 1–9 are included with the paper. Transcriptome sequencing and quantification data are available at ArrayExpress (www.ebi.ac.uk/arrayexpress) under the identifier E-MTAB-7978. The raw mass spectrometric data and MaxQuant result files have been deposited to the ProteomeXchange Consortium via PRIDE122, with the dataset identifier PXD013868.
Krämer, U. Planting molecular functions in an ecological context with Arabidopsis thaliana. eLife 4, (2015).
Peng, J. et al. ‘Green revolution’ genes encode mutant gibberellin response modulators. Nature 400, 256–261 (1999).
The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Kawakatsu, T. et al. Epigenomic diversity in a global collection of Arabidopsis thaliana accessions. Cell 166, 492–505 (2016).
Cheng, C. Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 89, 789–804 (2017).
The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45 (D1), D158–D169 (2017).
Baerenfaller, K. et al. Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science 320, 938–941 (2008).
van Wijk, K. J., Friso, G., Walther, D. & Schulze, W. X. Meta-analysis of Arabidopsis thaliana phospho-proteomics data reveals compartmentalization of phosphorylation motifs. Plant Cell 26, 2367–2389 (2014).
Durek, P. et al. PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update. Nucleic Acids Res. 38, D828–D834 (2010).
Sharma, K. et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 8, 1583–1594 (2014).
Schmidt, T. et al. ProteomicsDB. Nucleic Acids Res. 46 (D1), D1271–D1281 (2018).
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Bienvenut, W. V. et al. Comparative large scale characterization of plant versus mammal proteins reveals similar and idiosyncratic N-α-acetylation features. Mol. Cell. Proteomics 11, mcp.M111.015131 (2012).
Hazarika, R. R. et al. ARA-PEPs: a repository of putative sORF-encoded peptides in Arabidopsis thaliana. BMC Bioinformatics 18, 37 (2017).
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Zheng, Y. et al. iTAK: a program for genome-wide prediction and classification of plant transcription factors, transcriptional regulators, and protein kinases. Mol. Plant 9, 1667–1670 (2016).
Yang, M. et al. A comprehensive analysis of protein phosphatases in rice and Arabidopsis. Plant Syst. Evol. 289, 111–126 (2010).
Litt, A. & Kramer, E. M. The ABC model and the diversification of floral organ identity. Semin. Cell Dev. Biol. 21, 129–137 (2010).
Bar-On, Y. M. & Milo, R. The global mass and average rate of rubisco. Proc. Natl Acad. Sci. USA 116, 4738–4743 (2019).
Gupta, R. et al. Time to dig deep into the plant proteome: a hunt for low-abundance proteins. Front Plant Sci 6, 22 (2015).
Galván-Ampudia, C. S. & Offringa, R. Plant evolution: AGC kinases tell the auxin tale. Trends Plant Sci. 12, 541–547 (2007).
Zhang, Y., He, J. & McCormick, S. Two Arabidopsis AGC kinases are critical for the polarized growth of pollen tubes. Plant J. 58, 474–484 (2009).
Eraslan, B. et al. Quantification and discovery of sequence determinants of protein-per-mRNA amount in 29 human tissues. Mol. Syst. Biol. 15, e8513 (2019).
Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).
Hanson, G. & Coller, J. Codon optimality, bias and usage in translation and mRNA decay. Nat. Rev. Mol. Cell Biol. 19, 20–30 (2018).
Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011).
Santner, A. & Estelle, M. The ubiquitin-proteasome system regulates plant hormone signaling. Plant J. 61, 1029–1040 (2010).
Luo, J., Zhou, J. J. & Zhang, J. Z. Aux/IAA gene family in plants: molecular structure, regulation, and function. Int. J. Mol. Sci. 19, E259 (2018).
Bai, B. et al. Seed stored mRNAs that are specifically associated to monosome are translationally regulated during germination. Plant Physiol. 182, 378–392 (2019).
Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45 (D1), D362–D368 (2017).
Wang, Y., Tan, X. & Paterson, A. H. Different patterns of gene structure divergence following gene duplication in Arabidopsis. BMC Genomics 14, 652 (2013).
Lloyd, J. & Meinke, D. A comprehensive dataset of genes with a loss-of-function mutant phenotype in Arabidopsis. Plant Physiol. 158, 1115–1129 (2012).
Brandão, M. M., Dantas, L. L. & Silva-Filho, M. C. AtPIN: Arabidopsis thaliana protein interaction network. BMC Bioinformatics 10, 454 (2009).
Kristensen, A. R., Gsponer, J. & Foster, L. J. A high-throughput approach for measuring temporal changes in the interactome. Nat. Methods 9, 907–909 (2012).
Schwartz, D. & Gygi, S. P. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat. Biotechnol. 23, 1391–1398 (2005).
Villén, J., Beausoleil, S. A., Gerber, S. A. & Gygi, S. P. Large-scale phosphorylation analysis of mouse liver. Proc. Natl Acad. Sci. USA 104, 1488–1493 (2007).
Battaglia, M., Olvera-Carrillo, Y., Garciarrubio, A., Campos, F. & Covarrubias, A. A. The enigmatic LEA proteins and other hydrophilins. Plant Physiol. 148, 6–24 (2008).
Bah, A. et al. Folding of an intrinsically disordered protein by phosphorylation as a regulatory switch. Nature 519, 106–109 (2015).
Mitra, S. K. et al. An autophosphorylation site database for leucine-rich repeat receptor-like kinases in Arabidopsis thaliana. Plant J. 82, 1042–1060 (2015).
Landry, C. R., Levy, E. D. & Michnick, S. W. Weak functional constraints on phosphoproteomes. Trends Genet. 25, 193–197 (2009).
Hauser, F., Li, Z., Waadt, R. & Schroeder, J. I. SnapShot: abscisic acid signaling. Cell 171, 1708–1708 (2017).
Vaddepalli, P. et al. The C2-domain protein QUIRKY and the receptor-like kinase STRUBBELIG localize to plasmodesmata and mediate tissue morphogenesis in Arabidopsis thaliana. Development 141, 4139–4148 (2014).
Fulton, L. et al. DETORQUEO, QUIRKY, and ZERZAUST represent novel components involved in organ development mediated by the receptor-like kinase STRUBBELIG in Arabidopsis thaliana. PLoS Genet. 5, e1000355 (2009).
Smyth, D. R., Bowman, J. L. & Meyerowitz, E. M. Early flower development in Arabidopsis. Plant Cell 2, 755–767 (1990).
Johnson-Brousseau, S. A. & McCormick, S. A compendium of methods useful for characterizing Arabidopsis pollen mutants and gametophytically-expressed genes. Plant J. 39, 761–775 (2004).
Sprunck, S. et al. Egg cell-secreted EC1 triggers sperm cell activation during double fertilization. Science 338, 1093–1097 (2012).
Karimi, M., Inzé, D. & Depicker, A. GATEWAY vectors for Agrobacterium-mediated plant transformation. Trends Plant Sci. 7, 193–195 (2002).
Clough, S. J. & Bent, A. F. Floral dip: a simplified method for Agrobacterium-mediated transformation of Arabidopsis thaliana. Plant J. 16, 735–743 (1998).
Schmid, M. et al. A gene expression map of Arabidopsis thaliana development. Nat. Genet. 37, 501–506 (2005).
Boyes, D. C. et al. Growth stage-based phenotypic analysis of Arabidopsis: a model for high throughput functional genomics in plants. Plant Cell 13, 1499–1510 (2001).
Bowman, J. L. Arabidopsis: an Atlas of Morphology and Development (Springer-Verlag, 1994).
Bradford, M. M. A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding. Anal. Biochem. 72, 248–254 (1976).
Ruprecht, B. et al. Optimized enrichment of phosphoproteomes by Fe-IMAC column chromatography. Methods Mol. Biol. 1550, 47–60 (2017).
Marx, H. et al. A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics. Nat. Biotechnol. 31, 557–564 (2013).
Ruprecht, B., Zecha, J., Zolg, D. P. & Kuster, B. High pH reversed-phase micro-columns for simple, sensitive, and efficient fractionation of proteome and (TMT labeled) phosphoproteome digests. Methods Mol. Biol. 1550, 83–98 (2017).
Smith, P. K. et al. Measurement of protein using bicinchoninic acid. Anal. Biochem. 150, 76–85 (1985).
Zolg, D. P. et al. PROCAL: a set of 40 peptide standards for retention time indexing, column performance monitoring, and collision energy calibration. Proteomics 17, (2017).
Hahne, H. et al. DMSO enhances electrospray response, boosting sensitivity of proteomic experiments. Nat. Methods 10, 989–991 (2013).
Bian, Y. et al. Robust, reproducible and quantitative analysis of thousands of proteomes by micro-flow LC-MS/MS. Nat. Commun. 11, 157 (2020).
Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protocols 11, 2301–2319 (2016).
Hanada, K. et al. sORF finder: a program package to identify small open reading frames with high coding potential. Bioinformatics 26, 399–400 (2010).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Li, W., Jaroszewski, L. & Godzik, A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 282–283 (2001).
Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Franken, H. et al. Thermal proteome profiling for unbiased identification of direct and indirect drug targets using multiplexed quantitative mass spectrometry. Nat. Protocols 10, 1567–1593 (2015).
Toprak, U. H. et al. Conserved peptide fragmentation as a benchmarking tool for mass spectrometers and a discriminating feature for targeted proteomics. Mol. Cell. Proteomics 13, 2056–2071 (2014).
Oñate-Sánchez, L. & Vicente-Carbajosa, J. DNA-free RNA isolation protocols for Arabidopsis thaliana, including seeds and siliques. BMC Res. Notes 1, 93 (2008).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Silva, J. C., Gorenstein, M. V., Li, G. Z., Vissers, J. P. & Geromanos, S. J. Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition. Mol. Cell. Proteomics 5, 144–156 (2006).
The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45 (D1), D331–D338 (2017).
Cox, J. & Mann, M. 1D and 2D annotation enrichment: a statistical method integrating quantitative proteomics with complementary high-throughput data. BMC Bioinformatics 13 (Suppl. 16), S12 (2012).
Tyanova, S. et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731–740 (2016).
Olsen, J. V. et al. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 127, 635–648 (2006).
Uhlén, M. et al. Transcriptomics resources of human tissues and organs. Mol. Syst. Biol. 12, 862 (2016).
Rijpkema, A. S., Vandenbussche, M., Koes, R., Heijmans, K. & Gerats, T. Variations on a theme: changes in the floral ABCs in angiosperms. Semin. Cell Dev. Biol. 21, 100–107 (2010).
Heazlewood, J. L., Verboom, R. E., Tonti-Filippini, J., Small, I. & Millar, A. H. SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res. 35, D213–D218 (2007).
Löytynoja, A. Phylogeny-aware alignment with PRANK. Methods Mol. Biol. 1079, 155–170 (2014).
Castresana, J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552 (2000).
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
van der Graaf, A. et al. Rate, spectrum, and evolutionary dynamics of spontaneous epimutations. Proc. Natl Acad. Sci. USA 112, 6676–6681 (2015).
Gebert, D., Jehn, J. & Rosenkranz, D. Widespread selection for extremely high and low levels of secondary structure in coding sequences across all domains of life. Open Biol. 9, 190020 (2019).
Camiolo, S., Melito, S. & Porceddu, A. New insights into the interplay between codon bias determinants in plants. DNA Res. 22, 461–470 (2015).
Drummond, D. A., Bloom, J. D., Adami, C., Wilke, C. O. & Arnold, F. H. Why highly expressed proteins evolve slowly. Proc. Natl Acad. Sci. USA 102, 14338–14343 (2005).
Das, S. & Bansal, M. Variation of gene expression in plants is influenced by gene architecture and structural properties of promoters. PLoS ONE 14, e0212678 (2019).
Celaj, A. et al. Quantitative analysis of protein interaction network dynamics in yeast. Mol. Syst. Biol. 13, 934 (2017).
Niederhuth, C. E. et al. Widespread natural variation of DNA methylation within angiosperms. Genome Biol. 17, 194 (2016).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Nakazawa, N. fmsb: functions for medical statistics book with some demographic data. R package v.0.6.3; https://CRAN.R-project.org/package=fmsb (2018).
Zhang, Z. Variable selection with stepwise and best subset approaches. Ann. Transl. Med. 4, 136 (2016).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. A Stat. Soc. 58, 267–288 (1996).
R Core Team. R: A language and environment for statistical computing. https://www.R-project.org/ (R Foundation for Statistical Computing, 2014).
Knecht, W. Pilot Willingness to Take Off Into Marginal Weather, Part II: Antecedent Overfitting With Forward Stepwise Logistic Regression. Final Report DOT/FAA/AM-05/15 (Federal Aviation Administration, 2005).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Groemping, U. Relative importance for linear regression in R: the package relaimpo. J. Stat. Softw. 17, 1–27 (2007).
Heusel, M. et al. Complex-centric proteome profiling by SEC-SWATH-MS. Mol. Syst. Biol. 15, e8438 (2019).
McBride, Z., Chen, D., Reick, C., Xie, J. & Szymanski, D. B. Global analysis of membrane-associated protein oligomerization using protein correlation profiling. Mol. Cell. Proteomics 16, 1972–1989 (2017).
Ruepp, A. et al. CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 38, D497–D501 (2010).
Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, Article17 (2005).
Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, 719–720 (2008).
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45 (D1), D353–D361 (2017).
Fabregat, A. et al. The Reactome pathway Knowledgebase. Nucleic Acids Res. 44 (D1), D481–D487 (2016).
Hochberg, Y. B. Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. A Stat. Soc. 57, 289–300 (1995).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
List, M. et al. KeyPathwayMinerWeb: online multi-omics network enrichment. Nucleic Acids Res. 44 (W1), W98–W104 (2016).
Letunic, I. & Bork, P. 20 years of the SMART protein domain annotation resource. Nucleic Acids Res. 46 (D1), D493–D496 (2018).
Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).
Goel, R., Harsha, H. C., Pandey, A. & Prasad, T. S. Human Protein Reference Database and Human Proteinpedia as resources for phosphoproteome analysis. Mol. Biosyst. 8, 453–463 (2012).
Zourelidou, M. et al. The polarly localized D6 PROTEIN KINASE is required for efficient auxin transport in Arabidopsis thaliana. Development 136, 627–636 (2009).
Mayer, U. B. G. & Jurgens, G. Apical-basal pattern formation in the Arabidopsis embryo: studies on the role of the gnom gene. Development 177, 149–162 (1993).
Moes, D., Himmelbach, A., Korte, A., Haberer, G. & Grill, E. Nuclear localization of the mutant protein phosphatase abi1 is required for insensitivity towards ABA responses in Arabidopsis. Plant J. 54, 806–819 (2008).
Tischer, S. V. et al. Combinatorial interaction network of abscisic acid receptors and coreceptors from Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 114, 10280–10285 (2017).
Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46 (W1), W296–W303 (2018).
Nishimura, N. et al. Structural mechanism of abscisic acid binding and signaling by dimeric PYR1. Science 326, 1373–1379 (2009).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Pettersen, E. F. et al. UCSF Chimera—a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605–1612 (2004).
Box, M. S., Coustham, V., Dean, C. & Mylne, J. S. Protocol: A simple phenol-based method for 96-well extraction of high quality RNA from Arabidopsis. Plant Methods 7, 7 (2011).
Enugutti, B. et al. Regulation of planar growth by the Arabidopsis AGC protein kinase UNICORN. Proc. Natl Acad. Sci. USA 109, 15060–15065 (2012).
Koncz, C. & Schell, J. The promoter of TL-DNA gene 5 controls the tissue-specific expression of chimaeric genes carried by a novel type of Agrobacterium binary vector. Molecular and General Genetics MGG 204, 383–396 (1986).
Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods 9, 676–682 (2012).
Vizcaíno, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44 (D1), D447–D456 (2016).
Kwok, S. F. et al. Arabidopsis homologs of a c-Jun coactivator are present both in monomeric form and in the COP9 complex, and their abundance is differentially affected by the pleiotropic cop/det/fus mutations. Plant Cell 10, 1779–1790 (1998).
We thank the NGS@tum core facility for RNA sequencing, R. Tofanelli for help with imaging the ovules, R. J. Schmitz for providing data access for the feature analysis and M. Reinecke, F. Bayer and S. Galinec for mass spectrometry measurements. This work was in part funded by the German Science Foundation (DFG, SFB924), a research fellowship to H.S. by the Japan Society for the Promotion of Sciences, and a research fellowship to X.C. by the Chinese Research Council.
M.W. and B.K. are founders and shareholders of OmicScouts GmbH and msAId GmbH. They have no operational role in the companies. M.F. and D.P.Z. are founders and shareholders of msAId GmbH. T.M. and M.B. are employees and/or shareholders of Cellzome GmbH. The remaining authors declare no competing interests.
Peer review information Nature thanks José Dinneny, Paul Haynes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
a, Pairwise global Pearson’s expression correlation analysis of all 30 tissues (n = 1 measurement per tissue) on the transcriptome level (bottom triangle) and proteome level (top triangle) using all identified gene loci. Proteins correlate more strongly between tissues than transcripts. Turquoise squares mark examples for morphologically highly similar tissues. Tissues are coloured as in Fig. 1. b, Scatter plots showing highly reproducible abundance measurements for transcript (top) and protein (bottom) in morphologically similar tissues that were marked in a; namely, node (ND) versus internode (IND), leaf distal (LFD) versus leaf proximal (LFP) and root (RT) versus root upper zone (RTUZ). r denotes the Pearson’s correlation coefficient; n denotes the number of transcripts or proteins. c, Percentage of genes encoded by a specific chromosome that were identified at the transcriptome, proteome or phosphoproteome level. d, Percentage of Swiss-Prot and TrEMBL protein database entries as well as protein evidence categories from UniProt that were identified at the transcriptome, proteome or phosphoproteome level. Evidence level: (1) protein evidence; (2) transcript evidence; (3) homology; (4) predicted; and (5) uncertain. e, Comparison of protein identifications between an earlier Arabidopsis proteome study7 based on 12 tissues, this study (30 tissues) and the number of protein-coding genes in Araport11. f, iBAQ intensity distribution of proteins identified in this study. Proteins also identified in a previous study7 are projected into the same plot. g, Left, proportion of identified P-sites on S, T or Y residues with highly confident localization of the phosphorylation site within the identified peptide sequence (termed class I P-sites if the localization score is greater than 0.75). Right, distribution of proteins for which phosphorylated S, T or Y residues were identified. h, Left, Venn diagram comparing phosphoprotein datasets from a previous publication8, PhosPhAT4.0 and this study. Right, Venn diagram comparing P-site localization confidence between class I sites identified in this study and the low and high confidence datasets reported in a previous publication8.
a, Number of identified N-terminal (NT) or C-terminal (CT) peptides of proteins in either unmodified or phosphorylated form. b, Frequency of amino acids following the initiator methionine in N-terminal peptides with (−X) or without (M−X) cleavage of the initiator methionine. X denotes the amino acid after the start codon. c, Frequency of protein N-terminal acetylation for amino acids in b. Because trypsin was used for protein digestion, the frequencies for Arg and Lys residues could not be determined (n.d.). d, Distribution of peptide-based sequence coverage of proteins in individual tissues and for the combined dataset (tissue abbreviations as in Fig. 1). Boxes contain 50% of the data and show the median as a black line. The top and bottom quartile ranges are shown as whiskers. The number of proteins is indicated for each tissue. e, Pie charts showing the percentage of proteins identified by <3, 3–10 or >10 peptides either allowing shared (razor) peptides or restricting to unique peptides only. f, Left, number of protein isoforms detected at the transcript and protein level compared with the number of all annotated isoforms in Araport11. Right, number of multiple isoforms of the same gene distinguished at the peptide level. g, Validation of protein isoform and sORF identification by comparing the tandem mass spectra from the tissue atlas to those of synthetic peptide reference standards. The normalized spectral contrast angle (SA) was used as a similarity metric (Methods). Candidate isoforms and sORFs were considered valid if the spectral contrast angle of the spectra was >0.7. These data are reported in Supplementary Data 3. h, Amino acid sequence and mirror plots of tandem mass spectra for two peptides of the sORF BIP138_4. The spectra pointing upwards were collected from tissue digests; those pointing downwards were collected from synthetic peptides. The normalized spectral contrast angle and Pearson’s correlation coefficient (r) were used as similarity metrics (Methods) and indicate that both high-scoring spectra (n = 1 acquired spectra) are near identical, thus validating the identification of this sORF as an expressed protein. i, Dynamic range of transcript abundance (grey) and proportion of transcripts that were also identified at the protein level projected into this plot (blue). OM, orders of magnitude. Note that for lower abundance transcripts, fewer proteins were detected. j, Dynamic range of protein abundance and proportion of proteins with phosphorylation evidence. Protein abundance spans six orders of magnitude, whereas transcript abundance only spans four (i). In addition, note that phosphorylation was detected across the entire protein abundance range. k, Percentage of all annotated kinases (K), phosphatases (P), transcription factors (TF) and transcription regulators (TR) detected at the transcript, protein or phosphoprotein levels. Numbers below the x axis denote the number of genes for these protein classes in the A. thaliana genome.
a, Distribution of expression specificity categories for protein and transcript identifications. See Methods for the definition of these categories. In brief, there are very few transcripts and proteins that are only expressed in a single tissue. The quantities of the shared transcripts or proteins can differ vastly between tissues (b). b, Left, protein identifications shared between flower (FL) and flower organs showing an almost complete qualitative overlap of proteins. Sepal (SP), petal (PT), stamen (ST), carpel (CP). Right, clustering of z-scored protein intensities showing distinct quantitative expression differences between flower organs. c, Expression analysis of flower organ identity marker at the protein and transcript level. PISTILLATA (PI, green), APETALA3 (AP3, red), APETALA1 (AP1, orange), AGAMOUS (AG, blue). The expression of these markers is in line with the model of flower organ identity (AP1 expression marking sepal, AP1, AP3, PI marking petal, AG, AP3, PI marking stamen and AG marking carpel). d, Total number of transcripts plotted against the total number of proteins detected in each individual tissue (n = 30 tissues) showing that the more genes are expressed as mRNAs, the more proteins can be detected in a tissue (Pearson’s correlation r = 0.79). Tissues are coloured according to tissue groups as in Fig. 1. e, Cumulative abundance plots of intensity-ranked identifications of transcripts and proteins for five representative tissues. The five most abundant transcripts and proteins are listed in descending order for each tissue. These are generally not the same. In addition, note that the characteristics of the plots are not the same for all tissues. In flower, the protein line rises more quickly than the transcript line. The opposite is true for pollen and a more even characteristic is observed in seed. f, Distribution of shared and unique identifications among the 100 most abundant transcripts and proteins in each tissue. Relatively few proteins and transcripts are found together on the list of the 100 most abundant transcripts and proteins. This demonstrates that the quantitative differences in transcript and protein expression are more important in defining a tissue than the qualitative expression of transcripts or proteins. g, List of 11 proteins that were found as the most abundant protein (in at least one tissue) and their proportion of the total iBAQ intensity in each tissue. Individual proteins can represent up to 9% of the total protein in a given tissue. h, Principal component analysis (PCA) of the core tissue proteomes and transcriptomes (that is, the proteins and transcripts that were identified in every tissue) using z-scored abundances. Only about 30% of all protein and 20% of all mRNAs were detected in every of the 30 tissues despite the fact that all tissues were deeply profiled at both protein and transcript level. This shows that strong qualitative and quantitative expression differences exist between tissues. The PCA separates tissues into photosynthetically active versus inactive tissues (component 1) and separates pollen from all other tissues (component 2), indicating that the molecular composition of pollen is particularly different from all other tissues. i, Proportion of the total summed protein intensity for genes with specific subcellular compartment annotation (from SUBA77; Methods) in the different tissue groups. The comparison of photosynthetically active and inactive tissues shows that most of the protein content in photosynthetically active tissues is contained in the plastids, whereas most protein is found in the cytosol for photosynthetically inactive tissues. Proteins with only one single subcellular compartment annotation were selected for the plot and the proportion of their iBAQ intensities were averaged for each tissue group. Nucleus (n = 1,393), endoplasmatic reticulum (n = 58), Golgi (n = 68), peroxisome (n = 67), plastid (n = 525), mitochondrion (n = 317), vacuole (n = 71), cytosol (n = 385), cytoskeleton (n = 1), plasma membrane (n = 268), extracellular (n = 351).
a, Pearson’s correlation (r) of transcriptome and proteome expression (core datasets; n = 5,043) for each tissue. b, Pearson’s correlation between measured and predicted protein abundance levels in all tissues. Predicted protein abundance levels were obtained from the best fitting feature selection model for each tissue (Methods). The number of genes used for the correlation analysis is indicated for each tissue. c, Violin plots showing the spread in relative contribution of selected features to the prediction of gene-level protein abundance across tissues (n = 30 tissues) using our model. Violin shapes show the kernel density estimation of the data distribution and the median as white dot. Thick black bars denote the interquartile range. d, Specific nucleotide sequence motifs in 5′ UTRs of mRNAs contribute to the prediction of protein levels in a subset of tissues. Clustering tissues based on the presence or absence of detected 5′ UTR motifs shows that several features are repeatedly selected for inclusion in the model while others appear to be more tissue-specific. e, On the basis of the observation that the dN/dS ratio between orthologous of A. thaliana and A. lyrata contributed to the prediction of protein levels (c), we analysed this feature in more detail. Left, distribution of the dN/dS ratio for orthologous genes in A. thaliana and A. lyrata. The distribution is plotted for the example of ‘leaf distal’ (n = 6,447 genes). To compare evolutionarily conserved genes (defined by low dN/dS ratios) and genes that evolve neutrally or are under positive selection (high dN/dS ratios), we selected the bottom 5% and top 5% of the dN/dS ratio distribution, respectively. Right, evolutionarily conserved genes (low dN/dS ratio) show 10–20 times higher protein abundance than genes under evolutionary pressure. Boxes contain 50% of the data and show the median as a black line. Whiskers denote 1.5 times the interquartile range. Outliers were omitted from the plot for clarity. f, Time-course analysis of median protein abundance changes after treatment with CHX (translation block) or MG132 (proteasome block) versus time-matched DMSO control samples (Methods). Boxes contain 50% of the data and show medians as black lines. Whiskers denote 1.5 times the interquartile range. Outliers were omitted from the plot for clarity but were included in the statistical tests below. All proteins in the experiment (n = 8,920, grey), proteins that have a high PTR in seed (n = 425, red) or a low PTR in seed (n = 254, blue) (defined as in Fig. 3d) are shown. Differences between time points were tested for significance within each subset (all; high PTR; low PTR) using one-way ANOVA and the post hoc Tukey HSD test. ***P < 0.001 (all_CHX8–CHX16: P < 1 × 10−7; all_CHX8–CHX24: P < 1 × 10−7; all_CHX16–CHX24: P = 0.0002; highPTR_CHX8–CHX24: P = 0.0003; lowPTR_CHX8–CHX16: P = 0.0000004; lowPTR_CHX8–CHX24: P < 1 × 10−7; lowPTR_CHX16–CHX24: P < 1 × 10−7). g, Representative images of seeds after 4 days of incubation with CHX, MG132 or DMSO control medium (n = 1). Germination was completely inhibited by CHX and partially inhibited by MG132, showing that the drug treatments were effective.
a, Median PTRs across tissues plotted against the inter-tissue variation of these PTRs (expressed as MAD; proteins and transcripts had to be detected in at least 10 matching tissues to be included in the analysis). Arrows denote examples of genes with high PTRs (rbcL and petA) and low PTRs (IAA8 and IAA13). Bar plot shows the MAD range segmented into five quantiles, each containing the same number of genes (coloured bars and dashed lines). Most genes have reasonably stable PTRs across tissues. b, As in a (dataset n = 14,069) but for transcript (left) and protein (right) measurements. There is more variation in protein levels across tissues than there is mRNA variation (80% of all transcripts show a MAD of <1; 80% of all proteins show a MAD of 1.2). There is also more variation in the protein levels across tissues for low abundant proteins. This may in part be due to technical limitations as low abundance proteins can generally be less accurately quantified. c, As in a but for the ratio of phosphorylation site versus protein abundance. P-sites and proteins had to be detected in at least 10 matching tissues to be included in the analysis (n = 13,793). d, As in b (dataset n = 13,793) but for P-site abundance. P-site abundance shows greater variation across tissues than protein abundance (60% of all P-sites show MAD <1 compared with 80% of all proteins; see b). Again, this may in part be due to technical limitations as P-site quantification is performed on a peptide level and does not benefit from aggregating multiple peptide quantifications into one value for protein quantification.
Extended Data Fig. 6 Inferring redundant gene function and physical interactions from co-expression analysis.
a, Scatter plot of Pearson’s correlation coefficients (r) as a measure for co-expression across tissues for all pairs of proteins (x axis) and all pairs of transcripts (y axis) (core dataset only, n = 5,043) along with their marginal histograms. Colours denote the log10-normalized STRING scores of individual gene pairs as a measure of known or predicted direct (physical) or indirect (functional) associations. Strong co-expression of transcripts or proteins or both are more strongly related (physically or functionally) than transcripts and proteins that are not. b, Co-expression analysis of duplicated genes (pairs had to be detected in at least 10 matching tissues to be included in the analysis). The density plots show the distribution of Pearson’s correlation coefficients (r) of co-expressed transcripts (grey) or proteins (blue) for genes that arose by whole-genome duplications (WGD), local duplications or transposon-mediated duplications. Randomly selected gene pairs are shown as a control. Medians are given and displayed as dotted lines. There is substantial co-expression of duplicated genes, indicating that these genes probably have redundant functions. c, Left, protein-level Pearson’s correlation coefficient (r) values (from b) for all duplicate gene pairs (WGD, local, transposed) plotted against the protein abundance ratio of each pair (average across 30 tissues) (Methods). Blue arrows denote an example of a high or low ratio of protein production for the duplicated genes. Right, example for tissue-resolved protein intensity proportions (top-3) (Methods) for the duplicate pair MAC5A and MAC5B. Irrespective of the tissue, MAC5A is always much higher expressed than MAC5B. Tissues are coloured as in Fig. 1. d, Top, ranked protein abundance ratio for selected duplicate pairs (mean ± s.d.; n = 30) and annotated for phenotypic effects (bottom) in the loss-of-function mutant for either duplicate 1 or duplicate 2 (+). Minus symbols denotes absence of a phenotypic effect. Asymmetric protein production within duplicate pairs can be associated with the occurrence of a phenotype in the loss-of-function mutant of the higher expressed duplicate protein, indicating a dominant functional role of the more highly expressed protein. Blue arrows highlight MAC5A–MAC5B and PHB3–PHB4 as examples. e, Inference of physical protein–protein interactions from co-expression data. Distribution of pairwise Pearson’s correlation coefficients (r) of co-expressed proteins across (at least 10) tissues that are subunits of selected protein complexes. r > 0.5 (shaded in grey) was chosen as a cut-off for the selection of proteins for subsequent analysis to make sure that proteins present in well-characterized protein complexes are retained. CONSTITUTIVE PHOTOMORPHOGENESIS9 SIGNALOSOME (CSN), CELLULOSE SYNTHASE (CESA). f, Recovery of annotated protein–protein interactions by co-expression analysis. Distribution of Pearson’s correlation coefficients (r) of pairs of transcripts (grey) or protein (blue) that are annotated to interact physically in the AtPIN database33 (pairs had to be detected in at least 10 matching tissues to be included in the analysis). Subsets of the AtPIN database, namely interactions detected by the yeast two-hybrid (Y2H) method, by affinity purification–mass spectrometry (AP–MS) or both. r > 0.5 are shaded in blue (protein). Dotted lines denote median values. Co-expression only recovers a minority of annotated physical interactions andinteractions supported by more than one line of experimental evidence also tend to show stronger co-expression.
Extended Data Fig. 7 Inferring protein complexes and subunit stoichiometry from proteome correlation profiling using SEC–MS.
a, Molecular mass (MW) of monomeric proteins (determined from sequence) plotted against the mass determined from the apex of the elution profile for proteins identified by SEC–MS fractions of flower tissue (sFL). Inset shows the molecular mass calibration of the SEC column using a protein calibration standard (mass between 44 and 690 kDa). The distribution of proteins annotated in Araport11 is shown at the top. Many proteins show a much higher apparent molecular mass than would be expected from their sequences (data points above the x = y line). This suggests that these proteins engage in physical protein interactions that are sufficiently stable during SEC separation. b, SEC traces of proteins from five well-characterized protein complexes for flower, leaf and root tissue. Although the resolution of SEC separations is not very high, the complex subunits show very strong co-elution behaviour and the SEC separations of the five complexes are reproducible between tissues. CoA carboxylase n = 4 proteins; CDC48 n = 3 proteins; RubisCO n = 4 proteins; prefoldin n = 6 proteins; SCS n = 3 proteins. c, Intensity-normalized SEC elution profile of proteins for flower tissue. Proteins are ordered based on the SEC fraction in which their intensity peaks and the data are displayed as a heat map (n = 2,485 protein traces). Co-eluting proteins were grouped into ‘trace modules’ (Methods). Proteins in trace modules may represent members of protein complexes and thus serve as candidates for further experimental validation. d, To quantify how well protein complexes can be detected using co-expression analysis from data in the tissue atlas (TA) or by SEC–MS, a summary statistic termed ‘complex index’ was calculated (Methods). The complex index is 1 when all subunits of a complex are identified in the same module and no other proteins are contained in the module. Bar plots show examples for complex indices obtained from the different datasets and are divided into large (>4 subunits) and small (≤4 subunits) protein complexes (according to UniProt). Co-expression alone generates many candidates of interactors, but combining co-expression and SEC–MS analysis is an efficient way to prioritize candidates for follow-up experiments. e, Subunit heterogeneity within the coatomer complex. The coatomer complex consists of seven subunits, five of which (α, β, β′, ε and ζ) can be provided by twelve paralogues of these five genes. Plots show the protein proportions of these paralogues in all 30 tissues (data from tissue atlas). The coatomer complex has a similar composition in most tissues. A notable exception is seed tissues, in which production of subunit ζ-1 dominates over the two other paralogous proteins, suggesting that the coatomer complex in seed tissue also preferentially contains the ζ-1 subunit. Tissues are coloured as in Fig. 1. f, Absolute SEC intensity traces of individual complex subunits for determining subunit stoichiometry. Examples from left to right: the chaperonin complex (flower, 8 proteins, ratio of all subunits: 1:1), the 26S proteasome core and lid (flower, 14+17 proteins, ratio of all subunits: 1:1), the COP9 signalosome (flower, CSN; 8 proteins, ratio of all subunits: 1:1) and the CESA1–CESA3–CESA6 complex (root, 3 proteins, ratio of all subunits: 1:1). CSN3 and CSN5 were detected both as part of the CSN complex and in monomeric form. g, Top, total intensity of protein complex subunits across all tissues for the complexes shown in f (subunit intensities from the tissue atlas). Middle, relative proportion (mean ± s.d.; n = 30 tissues) of subunits across tissues (Methods). For the CESA complex, ratios were calculated for the subunit combinations CESA1–CESA3–CESA6 and CESA4–CESA7–CESA8. The stoichiometries determined from the tissue expression data are generally well-aligned with the expected 1:1 ratio of subunits in these complexes. As noted in f, a substantial amount of CSN5 was detected as a monomer in the SEC analysis, and the tissue expression atlas also shows higher relative expression of this protein compared with all other complex partners. This suggests that the protein is produced in excess over what is required for the COP9 complex (as observed previously123), and may therefore indicate an additional function within the cell.
a, Percentage of annotated kinases and phosphatases family members detected at the protein or phosphoprotein level. Parentheses denote the number of genes in each family in the Arabidopsis genome. b, Tissue-resolved combined intensity (that is, protein abundance) of families of kinases (left) and phosphatases (right). Tissues are coloured as in Fig. 1. Several tissues (notably pollen) stand out in terms of the expression of kinases and phosphatases, which indicates that these tissues are particularly active in phosphorylation-mediated dynamic signalling. c, Top, pie chart of specificity categories for kinases and phosphatases (see Methods for definition). Bottom, distribution of tissue-enhanced kinases and phosphatases across the 30 tissues. Several tissues (such as pollen) stand out in terms of the expression of certain kinases and phosphatases, which indicates tissue-specific signalling. d, Pie charts showing the proportion of proline-directed, acidic, basic and other motif categories for phosphorylated Ser (pS), Thr (pT) and Tyr (pY) residues. Only class I P-sites (localization score > 0.75; Methods) were considered in this analysis. e, Example motif logo plots for motifs such as proline-directed, acidic and basic. P-site motifs were identified using the motif-X algorithm (see Supplementary Table 2 for all 266 motifs). n denotes the number of phosphorylation sites that contain the respective motif; ‘fc’ denotes the fold change (that is, enrichment) of the motif in phosphorylated versus unmodified peptides (Methods). f, Enrichment of proline-directed (yellow), acidic (red), basic (blue) and other (grey) sequence motifs (circles) in the serine P-site dataset versus the same motifs detected in the background dataset of unmodified peptides (Methods). Motifs are shown for two, three and four fixed amino acid positions. The P-site in each motif example is underlined. ‘X’ denotes any amino acid. g, Number of identified P-sites for a given protein plotted against the sequence lengths of the same protein. LEA proteins are shown. h, Schematic of the LEA protein sequences (black bars). Pink denotes phosphorylated and blue denotes unphosphorylated STY residues. Almost all STY residues in LEA proteins can be phosphorylated. i, Schematic of the sequences and domain topology of the receptor-like kinases SRF4, FER and CERK1. P-sites often preferentially occur in specific domains, notably the juxtamembrane domain. Protein sequence regions covered by identified peptides are marked in blue, and P-sites are marked in pink.
a, P-site localization within the structure of RCAR10. The RCAR10 structure (blue) was modelled using the RCAR11 protein crystal structure (cornflower blue) as a template115. ABA-binding loops are shown in turquoise, P-sites in pink, and ABA ligand in yellow. b, RCAR10 expression across tissues at the protein (blue, iBAQ), transcript (grey, TPM) and P-site (pink, intensity) level. c, Tissue-resolved total protein intensity and relative proportions of the members of the PP2C co-receptor family. Seed tissues stand out in terms of overall expression as well as the dominance of AHG1 in these tissues. d, Measurement of the ABA response after expression of RCAR10 or phosphomimetic mutant variants in combination with different PP2C co-receptors in protoplasts (Methods). Columns display the average ABA response (mean ± s.d., n = 3) and grey dots indicate individual measurements. Co-expression of the phosphatases HAI1–HAI3 leads to similar responses in both phosphomimetic mutants, whereas other co-expressed phosphatases show diverse responses. e, QKY expression across tissues at the protein (blue, iBAQ), transcript (grey, TPM) and P-site (pink, intensity) level. f, Members of the MCTP family clustered by sequence similarity (left) and schematic of their domain structures along with detected P-sites (right). MCTP11a, MCTP12 and MCTP13 were not detected (n.d.) in this study. MCTP15 (also known as QKY) is in bold. g, Number of independent transgenic plant lines (qky-9 mutant background) transformed with wild-type QKY, phosphomutant (S262A, SA; blue) or phosphomimetic (S262E, SE; purple) constructs that show complete, partial or no rescue of the mutant phenotype. qPCR results (mean ± s.d., n = 3; individual data points as grey dots) show the relative transgene expression in wild-type, qky-9 mutant and selected transgenic lines. h–j, Representative confocal images of six-day-old qky-9 pQKY::mCherry:QKY (WT QKY; n = 14 roots), qky-9 pQKY::mCherry:QKY(S262A) (phosphomutant; n = 21 roots), qky-9 pQKY::mCherry:QKY(S262E) (phosphomimetic; n = 15 roots) root epidermal cells of the meristematic zone. The punctate signal along the cell circumference shows the expected localization of the QKY protein. Arrows indicate punctate structures. Scale bars, 5 μm.
This file contains Supplementary Table 1: List of validated peptide spectra for uncertain proteins including mirror plots between experimental and predicted spectra. Supplementary Table 2: Position weight matrix logo plots for all identified serine, threonine and tyrosine phosphorylation motifs.
Supplementary Data 1: This file contains tissue sample names and a description of their growth conditions.
Supplementary Data 2: This file contains the tissue atlas expression values on transcriptome, proteome and p-sites level.
Supplementary Data 3: This file contains a summary of expression evidence for gene isoforms and sORF sequences.
Supplementary Data 4: This file contains the results for 1D GO term enrichment analyses using transcript abundance or protein to RNA ratio (PTR) values as input.
Supplementary Data 5: This file contains a description of feature analysis parameters used in the protein abundance prediction model.
Supplementary Data 6: This file contains protein expression values for selected paralog genes and information about their mutational phenotypes.
Supplementary Data 7: This file contains size exclusion chromatography protein elution profiles for flower, leaf and root tissue.
Supplementary Data 8: This file contains GO term enrichment analysis for WGCNA protein and trace modules and lists gene identifiers in protein and trace module intersections.
Supplementary Data 9: This file contains annotations of selected complexes and their identification in tissue atlas or size-exclusion chromatography experiments.
Supplementary Data 10: This file lists all identified serine, threonine and tyrosine phosphorylation motifs.
Supplementary Data 11: This file contains a list of all primer sequences and gene IDs used in this study.
About this article
Cite this article
Mergner, J., Frejno, M., List, M. et al. Mass-spectrometry-based draft of the Arabidopsis proteome. Nature 579, 409–414 (2020). https://doi.org/10.1038/s41586-020-2094-2
Nature Plants (2021)
Nature Plants (2021)
Nature Genetics (2021)
High-resolution proteomics reveals differences in the proteome of spelt and bread wheat flour representing targets for research on wheat sensitivities
Scientific Reports (2020)
Scientific Data (2020)