An atlas of substrate specificities for the human serine/threonine kinome

Protein phosphorylation is one of the most widespread post-translational modifications in biology1,2. With advances in mass-spectrometry-based phosphoproteomics, 90,000 sites of serine and threonine phosphorylation have so far been identified, and several thousand have been associated with human diseases and biological processes3,4. For the vast majority of phosphorylation events, it is not yet known which of the more than 300 protein serine/threonine (Ser/Thr) kinases encoded in the human genome are responsible3. Here we used synthetic peptide libraries to profile the substrate sequence specificity of 303 Ser/Thr kinases, comprising more than 84% of those predicted to be active in humans. Viewed in its entirety, the substrate specificity of the kinome was substantially more diverse than expected and was driven extensively by negative selectivity. We used our kinome-wide dataset to computationally annotate and identify the kinases capable of phosphorylating every reported phosphorylation site in the human Ser/Thr phosphoproteome. For the small minority of phosphosites for which the putative protein kinases involved have been previously reported, our predictions were in excellent agreement. When this approach was applied to examine the signalling response of tissues and cell lines to hormones, growth factors, targeted inhibitors and environmental or genetic perturbations, it revealed unexpected insights into pathway complexity and compensation. Overall, these studies reveal the intrinsic substrate specificity of the human Ser/Thr kinome, illuminate cellular signalling responses and provide a resource to link phosphorylation events to biological pathways.

Protein phosphorylation is one of the most widespread post-translational modifications in biology 1,2 . With advances in mass-spectrometry-based phosphoproteomics, 90,000 sites of serine and threonine phosphorylation have so far been identified, and several thousand have been associated with human diseases and biological processes 3,4 . For the vast majority of phosphorylation events, it is not yet known which of the more than 300 protein serine/threonine (Ser/Thr) kinases encoded in the human genome are responsible 3 . Here we used synthetic peptide libraries to profile the substrate sequence specificity of 303 Ser/Thr kinases, comprising more than 84% of those predicted to be active in humans. Viewed in its entirety, the substrate specificity of the kinome was substantially more diverse than expected and was driven extensively by negative selectivity. We used our kinome-wide dataset to computationally annotate and identify the kinases capable of phosphorylating every reported phosphorylation site in the human Ser/Thr phosphoproteome. For the small minority of phosphosites for which the putative protein kinases involved have been previously reported, our predictions were in excellent agreement. When this approach was applied to examine the signalling response of tissues and cell lines to hormones, growth factors, targeted inhibitors and environmental or genetic perturbations, it revealed unexpected insights into pathway complexity and compensation. Overall, these studies reveal the intrinsic substrate specificity of the human Ser/Thr kinome, illuminate cellular signalling responses and provide a resource to link phosphorylation events to biological pathways.
Phosphorylation of proteins at serine, threonine, tyrosine and histidine residues controls nearly every aspect of eukaryotic cellular function 1,2,5,6 . Misregulation of protein kinase signalling commonly results in human disease 7 . Deciphering the cellular roles of any protein kinase requires the elucidation of its downstream effector substrates. However, the majority of kinase-substrate relationships that have been published to date involve a relatively small number of well-studied protein kinases, while few, if any, substrates have been identified for the majority of the approximately 300 human protein Ser/Thr kinases within the human kinome [8][9][10] . This lack of knowledge of kinase-substrate relationships limits the interpretation of large mass spectrometry (MS)-based phosphoproteomic datasets, which to date have collectively reported over 200,000 Ser and Thr phosphorylation sites on human proteins 3,4,[11][12][13] . The specific kinases that are responsible for these phosphorylation events have been reported for less than 4% of these sites 3 , substantially limiting the understanding of cellular phosphorylation networks.
Well-studied Ser/Thr kinases are generally known to recognize specific amino acid residues at multiple positions surrounding the site of phosphorylation [14][15][16][17] . This short linear motif, which is characteristic of a given protein kinase, ensures fidelity in signalling pathways regulating phosphorylation at a given Ser or Thr residue. Knowledge of kinase-recognition motifs can facilitate the discovery of new substrates, for example, by scanning phosphoproteomics data for matching sequences. However, to date, phosphorylation-site sequence motifs are known for only a subset of the human protein Ser/Thr kinome. We have previously described combinatorial peptide library screening methods that enable the rapid determination of specificity for individual kinases Article based on phosphorylation of peptide substrates 18,19 . Here we apply those methods to experimentally determine the optimal substrate specificity for the large majority of the human Ser/Thr kinome, characterize the relationship between kinases on the basis of their motifs, and computationally use these data to identify the protein kinases that are likely to phosphorylate any site identified using MS or other techniques. Finally, we show how this information can be applied to capture complex changes in signalling pathways in cells and tissues after genetic, pharmacological, metabolic and environmental perturbations.

Profiling kinase substrate specificity
Substrate-recognition motifs across the human Ser/Thr kinome were determined by performing a positional scanning peptide array (PSPA) analysis. We used a previously reported combinatorial peptide library that systematically substitutes each of 22 amino acids (20 natural amino acids plus phosphorylated Thr and phosphorylated Tyr) at nine positions surrounding a central phospho-acceptor position containing equivalent amounts of Ser and Thr 19 (Fig. 1a). Using purified recombinant kinase preparations, we successfully obtained phosphorylation-site motifs for 303 Ser/Thr kinases, covering every branch of the human Ser/Thr kinase family tree as well as a collection of atypical protein kinases (Fig. 1b, Supplementary Fig. 1 and Supplementary Tables 1 and 2). Profiling of the large majority of these kinases, including 83 understudied 'dark' kinases, was lacking 8 .
Position-specific scoring matrices (PSSMs) derived from quantified PSPA data were analysed by hierarchical clustering to compare kinase substrate motifs across the kinome ( Fig. 2 and Supplementary Table 2). As expected, kinases sharing substantial sequence identity displayed a high degree of similarity in their optimal substrate motifs. However, we found many cases in which clustering by PSSM did not strictly recapitulate the evolutionary phylogenetic relationships between kinases inferred from their primary sequences (Fig. 2). Instead, members of most of the major kinase groups were distributed throughout the dendrogram, reflecting numerous examples in which kinases with low overall sequence identity have converged to phosphorylate similar optimal sequence motifs. For example, we found that a number of distantly related kinases (in the YANK, CK1 and 2, GRK and TGFβ receptor families) converged to phosphorylate similar sequence motifs despite their disparate locations on the kinome tree ( Fig. 2 (cluster 3)).
Overall inspection of sequence motifs associated with various branches of the motif-based dendrogram revealed that approximately 60% of the Ser/Thr kinome could be represented by simple assignment to one of three previously observed motif classes: selectivity for basic residues N terminal to the phosphorylation site (cluster 1; Fig. 2); directed by a proline residue at the +1 position (cluster 2); or a general preference for negatively charged (acidic and phosphorylated) residues at multiple positions (cluster 3) 15,20,21 . Notably, more than half of all of the reported phosphorylation sites observed by MS could be assigned to one of these three signatures (Extended Data Fig. 1). However, each of these motif classes could be further subcategorized on the basis of selectivity both for and against distinct sets of residues at other positions, reflecting considerable diversity within these clusters (Extended Data Figs. [2][3][4]. The remaining approximately 40% of the Ser/Thr kinome comprised many smaller groups that displayed unique sequence determinants ( Fig. 2; clusters 4-17). For example, motifs for the PIKK family kinases (ATM, ATR, DNA-PK and SMG1) clustered into a group that primarily selected a Gln residue at the +1 position (cluster 5), consistent with previous studies 22,23 . Notably, several clusters displayed shared consensus motifs that have not been well recognized previously, such as the group including the IRAK, IRE, WNK, SNRK and RIP kinases (cluster 13), of which the substrate motifs contained basic residues both N and C terminal to the phosphorylation site with dominant selection for aromatic residues at the +3 position. As another example, the kinases LKB1, CAMKK, PINK1 and PBK (cluster 14) primarily recognized hydrophobic residues N terminal to the phosphorylation site in combination with selection for turn-promoting residues (Gly or Asn) at the +1 position. Structural modelling of kinase-peptide complexes revealed complementary features within the kinase catalytic clefts that are probably responsible for the recognition of these motifs (Extended Data Fig. 5a,b).
An important and less generally recognized feature that dominated the clustering was strong negative selection against either positively or negatively charged residues at distinct positions within a motif, suggesting that electrostatic filtering strongly influences kinase substrate selection throughout the kinome 24 . We identified additional classes of amino acids, such as hydrophobic residues, that are selected against by a variety of kinases. These trends suggest that substrate avoidance has a fundamental role in dictating correct kinase-substrate interactions 25,26 .
Unexpectedly, we observed that many kinases (129 out of 303) selected either a phosphorylated Thr or a phosphorylated Tyr as the preferred amino acid in at least one position within the motif (Supplementary Fig. 1; where selectivity for phosphorylated Ser was assumed to be equivalent to phosphorylated Thr). In addition to kinases of which the dependence on phospho-priming was previously known (GSK3, CK1 and CK2 families; cluster 3), this phenomenon was particularly evident for the GRK-and YANK-family kinases (Extended Data Fig. 4), both of which have complementary basic residues within their catalytic domains (Extended Data Fig. 5c,d). Notably, individual GRK-family members showed unique and specific selection for the location of the phosphorylated Thr or phosphorylated Tyr residue within their substrate peptides. GRKs are best known for their role in the desensitization of G-protein-coupled receptors (GPCRs), whereby multisite phosphorylation induces the binding of arrestin proteins to inhibit signalling 27,28 . Our findings suggest that the capacity for only seven GRKs to differentially regulate 800 distinct GPCRs probably involves a complex interplay between initial sequence-specific phospho-priming of GPCRs by other Ser/Thr and Tyr kinases, followed by a second level of specificity resulting from GRK-dependent phosphorylation and subsequent recognition by a small number of β-arrestins.
Features of substrate-recognition motifs across the entire kinome could be structurally rationalized on the basis of the presence of specificity-determining residues at particular positions within the kinase catalytic domain [29][30][31][32] , leading to both expected and unexpected findings. For example, we found that half of the kinases display some degree of selectivity for either a Ser or a Thr as the phospho-acceptor residue (Extended Data Figs. 6 and 7). Consistent with our previously published observations 33 , Ser or Thr phospho-acceptor site preference strongly correlated with the identity of the 'DFG+1' residue (that is, the residue immediately after the canonical Asp-Phe-Gly motif within the kinase activation loop), with bulky residues (Phe, Trp, Tyr) at this position in Ser-selective protein kinases and β-branched residues (Val, Ile, Thr) at this position in Thr-selective kinases. However, for some DFG+1 residues, Ser versus Thr selectivity was unexpectedly context dependent. For example, a Leu residue at the DFG+1 position was observed in both Ser-selective and dual-specificity kinases, whereas a DFG+1 Ala residue resulted in a preference for Thr phosphorylation in the context of some kinases (for example, the mitogen-activated protein kinase kinase kinases (MAP3Ks)), but a preference for Ser specificity in others (the IκB kinases). These observations, notable only within the context of the complete Ser/Thr kinome, indicate that additional residues beyond the previously established DFG+1 position can influence Ser/Thr specificity in a context-dependent manner.

Annotation of the human phosphoproteome
Comprehensive knowledge of human Ser/Thr kinase specificity has the potential to 'deorphanize' the large number of reported phosphorylation sites with no associated kinase. To do so, we generated Nature | Vol 613 | 26 January 2023 | 761 a kinome-wide annotation of the human Ser/Thr phosphoproteome, comprising a previously curated set of 82,735 sites that were detected using high-throughput MS 4 plus an additional 7,017 sites that were identified using only low-throughput methods 3 . Using approaches adapted from previously published research, we computationally ranked these 89,752 sites against each Ser/Thr kinase motif 34,35 (Fig. 3a and Supplementary Table 3). Notably, approximately 99% of these phosphorylation sites ranked favourably for at least one kinase that we profiled (that is, the site scored in the top 10% of all sites in the human phosphoproteome for that kinase). These annotations were strongly concordant with sites for which protein kinases involved have been previously identified. For phosphorylation sites of which the upstream kinase has been previously verified by at least three independent reports, encompassing 969 sites and over one third of the kinome, our motif-based approach yielded a median percentile of 95% (that is, the reported site received a higher score than 95% of all putative phosphorylation sites in the phosphoproteome for its established kinase) (Extended Data Fig. 8a). Furthermore, when we back-mapped the motifs of all 303 profiled kinases onto the literature-reported phosphorylation sites, our approach yielded a median reported kinase percentile of 92% (that is, the reported kinase scored more favourably than 92% of all profiled kinases in our atlas for its established substrate) (Extended Data Fig. 9a). These rankings further improved when we considered kinasesubstrate pairs with higher numbers of previous reports (Extended Data Figs. 8 and 9), suggesting that, in a large majority of cases, the linear sequence context of phosphorylation sites contributes substantially to kinase-substrate relationships.
Notably, motif predictions alone successfully identified numerous prominently studied kinase-substrate relationships. For example, phosphorylase kinases PHKG1 and PHKG2 emerged as the top two hits (out of 303 kinases) for phosphorylating Ser15 of glycogen phosphorylase (Fig. 3b). This phosphoregulatory event, the very first to be discovered 36 , opened up the entire field of phosphorylation-dependent signal transduction. The most highly cited kinase-substrate interaction reported to date is the phosphorylation of the tumour suppressor p53 at Ser15 by the DNA-damage-activated kinase ATM, which scored among the top-ranking kinases associated with that site (Fig. 3c). Notably, other kinases reported to phosphorylate the same site-ATR, SMG1 and DNAPK-scored within the top four predicted kinases 3 .
Our approach could also correctly identify kinases for phosphorylation events driven by substrate co-localization or non-catalytic docking interactions, for which we expected less dependence on the phosphorylation-site motifs of their kinases. For example, we correctly identified both the mitochondrial-localized phosphorylation of pyruvate dehydrogenase by the pyruvate dehydrogenase kinases (Extended Data Fig. 10a) and the docking-driven phosphorylation of the MAP kinase ERK by MEK 37 (Extended Data Fig. 10b). Notably, the phosphorylation site on ERK was selected against by nearly every human protein kinase that we profiled except for MEK, explaining how ERK can be exclusively regulated by MEK while avoiding phosphorylation by the kinome at large. Finally, our approach could tease apart kinase subfamilies with similar motifs and correctly assign them to their established substrates. For example, we could distinguish between the CDK family kinases that assume classical roles in cell cycle progression (that is, CDK1, CDK2, CDK3, CDK4 and CDK6) and the subset of CDKs that govern gene transcription (that is, CDK7, CDK8, CDK9, CDK12, CDK13 and CDK19) 38,39 (Extended Data Fig. 11).
Functional annotation of the human phosphoproteome enabled us to examine global trends in kinase-substrate interactions. We found that most phosphorylation sites could be assigned to a small number of putative kinases (that is, BRAF-MEK1, ATM-p53 and CDK4-Rb; Fig. 3d and Supplementary Table 3). However, approximately one-third of all sites lacked unique negative sequence-discriminating features and, instead, matched well to the optimal phosphorylation motifs for a greater number of kinases 21,40,41,42 (that is, Ser119 of CREB, Ser9 of GSK3B and Thr1079 of LATS1; Fig. 3d). This could suggest the importance of other kinase-determining factors (scaffolds, localization and so on) for proper kinase substrate recognition, or may indicate that these specific phosphorylation sites are points of convergence for multiple signalling pathways. For example, cAMP response element binding protein (CREB) is canonically phosphorylated at Ser119 by cAMP-dependent protein kinase (PKA); however, numerous previous reports demonstrate that a broad range of cellular stimuli and drug perturbations impinge on the phosphorylation of this site by no less than ten distinct kinases 3 . Taken together, these findings suggest that the presence of negative-selectivity elements flanking a putative phosphorylation site can be used to insulate a substrate from inappropriate phosphorylation by dozens of related kinases, whereas the absence of such negative Article selectivity can enable protein kinases in distinct pathways to converge on the same target.

Motif-enrichment analysis
Cell signalling networks are complex and dynamic. Perturbation of kinase signalling pathways by genetic manipulations, treatment with growth factors and ligands, environmental stress or small-molecule inhibitors reshapes the phosphoproteome through both direct and indirect effects as a consequence of secondary signalling responses and/or off-target effects from the experimental treatment 43 .
Owing to the interconnected and dynamic nature of phosphorylation networks, distinguishing between initial signalling events and those that result from the subsequent activation of additional signalling pathways is a common and challenging problem. We reasoned that kinases underlying both primary and secondary phosphorylation events could potentially be revealed by a global motif-based analysis of changes in the corresponding phosphoproteome. To test this idea, we used publicly available MS datasets from cells collected in the absence or presence of various perturbations and scored all phosphorylation sites with our atlas of Ser/Thr kinase motifs. Kinase motifs that were significantly enriched or depleted after experimental treatment were then represented as volcano plots of motif frequencies and adjusted P values (Fig. 4a).
Using this approach, we found that sequence motifs corresponding to the most direct target of a genetic or chemical perturbation were among the most significantly regulated, as seen, for example, for the genetic deletion of the secreted primordial casein kinase FAM20C (Fig. 4b). When quantitative phosphoproteomics data from HepG2 cells lacking FAM20C 44 were analysed using our kinome-wide dataset, the most downregulated kinase-recognition motif corresponded to that of FAM20C. Similarly, when skeletal-muscle-like myotube cells were stimulated for 30 min with isoproterenol 45 , the most upregulated phosphorylation motifs corresponded to multiple isoforms of cAMP-dependent protein kinase (PKA)-canonical effector kinases downstream of the β 1 and β 2 adrenergic receptors (Fig. 4c). Notably, PKA motifs are highly similar to those of several other basophilic kinases, yet we could identify their enrichment in this scenario. Moreover, our comprehensive Ser/Thr kinome motif collection elucidated secondary signalling events in a dataset from HeLa cells arrested in mitosis using the PLK1 inhibitor BI 2536 (Fig. 4d) 46 ; here, in addition to observing a notable downregulation of substrates containing the optimal PLK1 motif, we also noted upregulation of substrates phosphorylated by ATM and ATR. This finding is in good agreement with previous reports that PLK1 can suppress DNA damage signalling in mitotic cells 47,48 .
Our motif-based analysis could also be used to reveal key signalling events resulting from more complex interventions. For example, we examined phosphoproteomic data from A549 cells treated with 6 Gy of ionizing radiation 49 (Fig. 4e). Our analysis revealed the up-and downregulation of numerous signalling pathways, including upregulation of canonical kinases involved in the DNA-damage response (ATM, ATR, DNA-PK) and downregulation of canonical kinases involved in cell cycle progression (CDK1, CDK2, CDK4 and CDK6), consistent with G1/S and G2/M arrest. Furthermore, we found up-and downregulation of less-appreciated DNA-damage-responsive kinases (MAPKAPK2 50,51 , PLK3 52 and LRRK2 53 ).
The full collection of Ser/Thr kinome motifs also enabled the temporal dynamics of signalling to be resolved from time-resolved phosphoproteomic datasets. For example, motif-based analysis of phosphoproteomic data from insulin-treated 3T3-L1 adipocytes 54 revealed rapid activation of the phosphoinositide 3-kinase signalling pathway within 1 min after insulin stimulation followed by subsequent activation of the MAPK pathway, together with downregulation of AMP-activated protein kinases within 60 min (Fig. 4f). Similarly, phosphoproteomic data analysis from lipopolysaccharide-stimulated dendritic cells 55   . .

ATM PHKG1
Most experimentally supported phosphorylation V-E-P-P-L-S-Q-E-T-F  Article and suppression of the MAPKs in parallel with the downregulation of their upstream MAPK3Ks (MEKK1, MEKK2 and ZAK) 56 , suggestive of a negative-feedback loop (Fig. 4g). Thus, comprehensive motif-based approaches, when applied to time-resolved phosphoproteomics experiments, can decipher the distinct temporal dynamics of different groups of kinases.

Discussion
Here we present the full spectrum of substrate motifs of the human serine/threonine kinome and provide an unbiased comprehensive framework to further explore their cellular functions. Globally, these motifs are substantially more diverse than expected, suggesting a broader substrate repertoire of the kinome. Hierarchical clustering of this dataset reorganized the kinome into at least 38 motif classes and introduced several shared motif features ( Fig. 2 and Extended Data Figs. 2-4). The Ser/Thr kinases that we profiled were, almost without exception, strongly discriminatory against specific motif features. These findings suggest that fidelity in kinase signalling pathways is largely achieved through selective pressure on substrates to avoid phosphorylation by the majority of irrelevant kinases, and that this may occur by tuning the amino acid sequences surrounding the phosphorylation sites to be disfavoured by non-cognate kinases. As this negative selection contributes substantially to proper substrate recognition, accurate identification of kinase-substrate relationships requires a comprehensive knowledge of kinase phosphorylation motifs-not only for an individual kinase of interest, but also for all other kinases in the human kinome that might compete for the same substrate pool.
When this kinome-wide dataset was used to predict the specific kinases that are responsible for substrate phosphorylation solely based on the amino acid sequence surrounding the phosphorylation site, the results were highly accurate at identifying correct kinase-substrate relationships, even without knowledge of tissue specificity, scaffolding effects or subcellular localization. Including such additional information will probably further improve these predictive approaches 57,58 . A limitation of using first-order peptide arrays in these experiments is that they do not directly measure the contributions of interpositional contacts within the substrate peptides, which we have previously shown can affect substrate selection for some tyrosine kinases 32 , albeit less so for Ser/Thr kinases 59 . Moreover, we were unable to differentiate between positional selection of Ser or Thr residues and direct phosphorylation of neighbouring residues (for example, peptides containing more than one phospho-acceptor). Structural modelling approaches guided by kinase substrate motif data will potentially decipher this additional information to further improve predictions 60,61 .
The examination of MS phosphoproteomic datasets using this global collection of motifs yielded potential biological insights and putative kinase substrates (Fig. 4). For example, in cells undergoing exposure to ionizing radiation (Fig. 4e), ATM was predicted to target 37 of the phosphorylation sites that were upregulated, most of which have never been associated as substrates for ATM (Supplementary Table 4).   46 ). e, A549 cells 2 h after exposure to 6 Gy of ionizing radiation 49 . f, 3T3-L1 adipocytes after serum starvation and then 1 min and 60 min treatment with 100 nM insulin 54 . g, C57BL/6J mouse bone-marrow-derived dendritic cells after 30 min and 4 h treatment with 100 ng ml −1 lipopolysaccharide (LPS) 55 . The enrichments in b-g were determined using one-sided exact Fisher's tests and corrected for multiple hypotheses using the Benjamini-Hochberg method. Fully annotated versions of these plots are presented in Supplementary Fig. 2.
As the application of phosphoproteomics to human clinical samples and disease model systems continues to advance, our comprehensive motif-based approach will be uniquely equipped to unravel the complex signalling that underlies human disease progressions, mechanisms of cancer drug resistance, dietary interventions and other important physiological processes. In summary, we foresee that this will provide a valuable resource for a broad spectrum of researchers who study signalling pathways in human biology and disease.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-022-05575-3.

Cell lines
Expi293 (Thermo Fisher Scientific) and HEK293T (ATCC) cells were obtained directly from vendors that perform short tandem repeat genotyping for authentication of human cells and were verified to be mycoplasma-free. Sf9 insect cells were obtained from Thermo Fisher Scientific.

Plasmids
For expression and purification from bacteria, the DNA sequences for the human Ser/Thr kinases, binding partners and chaperones listed below were codon-optimized for expression in Escherichia coli using GeneSmart prediction software (GenScript). Optimized coding sequences were synthesized as gBlocks (Integrated DNA Technologies) carrying 16 bp overhangs at the 5′ and 3′ ends to facilitate in-fusion cloning (Clontech) into pET expression vectors (EMD Millipore). pCDFDuet1 constructs were as follows: HSP90AA1-His 6 (full length), hereafter referred to as 'HSP90'; untagged HSP90(full length); His 6  For enhanced expression in mammalian lines cells, the DNA sequences of His 6 -GST-SBK1(full length) and Flag-His 6 -WNK3(1-434) were optimized for expression in Homo sapiens using GeneSmart (Gen-Script) and synthesized as gBlocks (Integrated DNA Technologies) carrying 16 bp overhangs to facilitate in-fusion cloning into digested pCDNA3.4 (Thermo Fisher Scientific).

Expression and Purification from bacteria
Transformations were performed using BL21 Star cells (Thermo Fisher Scientific) unless specified otherwise. Antibiotic concentrations used were as follows: carbenicillin (100 mg l −1 ), kanamycin (50 mg l −1 ), spectinomycin (25 mg l −1 ) and chloramphenicol (25 mg l −1 in ethanol, prepared fresh). Transformed cells were grown in 1 l Terrific Broth by shaking at 190 rpm at 37 °C until the optical density (λ = 600 nm) reached 0.7-0.8, at which point 1 mM IPTG was added to induce expression. The cells were then transferred to a refrigerated shaker and shaken at 220 rpm at 18 °C for 16-20 h. Cells were centrifuged at 6,000g, and the pellets were snap-frozen in liquid nitrogen and stored at −80 °C.
All of the steps for protein purification were performed at 4 °C. Cell pellets were solubilized in lysis buffer (described below) and lysed by probe sonication. The lysate was centrifuged at 20,000g for 1 h and the supernatant was combined with affinity purification resin, nickel NTA (Qiagen) or glutathione Sepharose (GE Health) that had been rinsed in base buffer. The supernatant-bead slurry was agitated for 30 min. Resin was washed with 1 l base buffer and eluted in 10 bed volumes of elution buffer. Eluted protein was concentrated using the Ultra Centrifugal Filter Units (Amicon), supplemented with 1 mM DTT and 25% glycerol, and snap-frozen in liquid nitrogen and stored at −80 °C.

Expression and purification from mammalian cells
Expi293F cells (Thermo Fisher Scientific) were cultured in 500 ml Expi293 Expression Medium (Thermo Fisher Scientific) in 2 l spinner flasks on a magnetic stirring platform at 100 r.c.f. at 36.8 °C under 8% CO 2 . For transfection, 500 μg of expression constructs was diluted in Opti-MEM I Reduced Serum Medium (Thermo Fisher Scientific). ExpiFectamine 293 Reagent (Thermo Fisher Scientific) was diluted with Opti-MEM separately then combined with diluted plasmid DNA for 10 min at room temperature. The mixture was then transferred to the cells (3 × 10 6 cells per ml) and stirred. Then, 20 h after transfection, ExpiFectamine 293 Transfection Enhancer 1 and Enhancer 2 (Thermo Fisher Scientific) were added to the cells. Two days later, the cells were centrifuged at 300g for 5 min, snap-frozen in liquid nitrogen and stored at −80 °C (3 days after transfection).
All of the steps for protein purification were performed at 4 °C. Cell pellets were solubilized in lysis buffer and lysed by dounce   [80][81][82][83][84]. After completion of the reaction, the solutions were spotted onto streptavidin-conjugated membranes (Promega, V2861), where the peptides tightly associated through their C-terminal biotinylation. The membranes were rinsed and then imaged using the Typhoon FLA 7000 phosphorimager (GE) to measure the extent of peptide phosphorylation. Raw data (GEL file) were quantified using ImageQuant (GE) to generate densitometry matrices (Supplementary Table 2). For the kinase ALPHAK3, spots were normalized to the surrounding background, owing to spatial variation in background signal. PDHK1 and PDHK4 showed dual specificity for serine and tyrosine. For these kinases, we used a customized peptide substrate library devoid of tyrosine residues at randomized positions.
In total, 283 human kinase motifs, one motif from a mouse kinase orthologue (CDKL5), one motif from a rat kinase orthologue (KIS) and one motif from an arthropod Pediculus humanus corporis kinase orthologue (PINK1), were combined with 17 human kinase motifs that we previously published, including AKT1 85 , SRPK1 35  For the zero-control experiments in Extended Data Fig. 6, biotinylated peptides were synthesized containing only serine or threonine as the phospho-acceptor, where all nine surrounding positions contained degenerate mixtures of the 20 natural amino acids excluding serine, threonine, tyrosine and cysteine.

Matrix processing
The densitometry matrices were column-normalized at all positions by the sum of the 17 randomized amino acids (excluding serine, threonine and cysteine), to yield PSSMs (Supplementary Table 2). PDHK1 and PDHK4 were normalized to the 16 randomized amino acids (excluding serine, threonine, cysteine and additionally tyrosine), corresponding to the uniquely customized peptide library that profiled these kinases. The cysteine row was scaled by its median to be 1/17 (1/16 for PDHK1 and PDHK4). The serine and threonine values in each position were set to be the median of that position. The ratio of serine versus threonine phospho-acceptor favourability (S 0 and T 0 , respectively) was determined by summing the values of the serine and threonine rows in the densitometry matrix (S S and S T , respectively), accounting for the different serine versus threonine composition of the central (1:1) and peripheral (only serine or only threonine) positions (S ctrl and T ctrl , respectively), and then normalizing to the higher value among the two (S 0 and T 0 , respectively, Supplementary Note 1).

Matrix clustering
The dendrogram in Fig. 2 was generated using the normalized matrices with the 20 unmodified amino acids, as well as phosphothreonine and phosphotyrosine. The linkage matrix was computed using the SciPy package in Python (v.3.7.6), using the Ward method. Results were converted to the Newick tree format and plotted using FigTree (v.1.4.4).

Substrate scoring
For scoring substrates, the values of the corresponding amino acids in the corresponding positions were multiplied and scaled by the probability of a random peptide (Supplementary Note 2).
For the percentile score of a substrate by a given kinase, we first computed the a priori score distribution of that kinase PSSM by scoring a reference Ser/Thr phosphoproteome comprising 82,735 identified sites 4 using the method discussed above (Fig. 3a). The percentile score of a kinase-substrate pair is defined as the percentile ranking of the substrate within the score distribution of each kinase 34 . This value was used when analysing all of the detected phosphorylation sites for kinase enrichment.

Kinase enrichment analysis
The single phosphorylation sites (not including multi-phosphorylated peptides) in the analysed phosphoproteomics studies were scored by all of the characterized kinases (303 Ser/Thr kinases), and their ranks in the known phosphoproteome score distribution were determined as described above. For every non-duplicate, singly phosphorylated site, kinases that ranked within the top 15 kinases for the Ser/Thr kinases were considered to be biochemically favoured kinases for that phosphorylation site. For assessing kinase motif enrichment in phosphoproteomics datasets, we compared the percentage of phosphorylation sites for which each kinase was predicted among the upregulated/ downregulated (increased/decreased, respectively) phosphorylation sites (sites with |log 2 [fold change]| equal or greater than the log[fold change] threshold), versus the percentage of biochemically favoured phosphorylation sites for that kinase within the set of unregulated (unchanged) sites in this study (sites with |log 2 [fold change]| less than the log[fold change] threshold). The log-transformed fold change threshold was determined to be 1.5 for all panels in Fig. 4, except for Fig. 4e, in which the threshold was set to 0.5 owing to the low range of the log[fold change] in the data. Contingency tables were corrected using Haldane correction (adding 0.5 to the cases with zero in one of the counts). Statistical significance was determined using one-sided Fisher's exact tests, and the corresponding P values were adjusted using the Benjamini-Hochberg procedure. Kinases that were significantly enriched (adjusted P ≤ 0.1), or depleted (log 2 [frequency factor] < 0) for both upregulated and downregulated analysis were excluded from downstream analysis. Then, for every kinase, the most significant enrichment side (upregulated or downregulated) was selected on the basis of the adjusted P value and presented in the volcano plots.

Sequence logos
Sequence logos were made using logomaker package in Python 90 . For individual kinases, the normalized matrix was used, where the height of every letter is the ratio of its value to the median value of that position. The serine and threonine heights in the central position (position zero) were set to the ratio between their favourability. For clustered groups of kinases, the average matrix was calculated and presented as sequence logo as described above.

Comparative analyses between amino acids in the kinase domains and their substrate specificities
For Extended Data Fig. 7, kinases were sorted by their log 2 [S 0 /T 0 ] values. For the sequence logo, kinase domains of 290 available kinases were obtained from previously aligned kinase sequences 91 . The alignments to residues Met1-Leu296 in CDK2 (Protein Data Bank (PDB): 1QMZ) were obtained for each kinase, and the frequencies of amino acids in increments of 15-kinases were calculated and plotted as a sequence logo.

Known kinase-substrate pairs
Experimentally validated kinase-substrate relationships were obtained from PhosphoSitePlus ( July 2021). The number of reports for each pair was determined by the sum of the in vivo and in vitro reports.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
The data generated (RAW files) and analysed in this study are provided in this paper. All plasmids generated in this study have either been deposited at Addgene or available on request.