Introduction

Active transport between the nucleus and cytoplasm is an essential regulatory mechanism for many cellular proteins. As a major nuclear exporter factor, chromosome maintenance protein 1 (CRM1; or exportin-1, XPO1) mediates nuclear export of hundreds of distinct cargo proteins by recognizing short sequence motifs called Nuclear Export Signal (NES)1,2,3. CRM1 shuttles between the nucleus and the cytoplasm, binds cargo molecules at high RanGTP levels inside the nucleus, traverses nuclear pore complex (NPC) as ternary cargo–CRM1–RanGTP complexes, and releases cargo into the cytoplasm upon hydrolysis of the Ran-bound GTP4. Since spatial re-localization of oncoproteins and tumor suppressor proteins is important in cancer cells, understanding of the NES can help the basic research about this process and can also help the discovery of anticancer agents5.

Classical NES motifs in the early studies were referred to as a cluster of hydrophobic residues, mostly leucines (hence also called Leu-rich NES), within a 10–15 residue-long sequence motif1,6,7. Many years of research on various export cargoes and randomization-and-selection screens showed that more residue types, such as Ile, Val, Met, and Phe, are also allowed at the hydrophobic positions of the CRM1-dependent NES signals8,9. These hydrophobic residues (Φ) are spaced with various patterns following the consensus Φ1-(x)2–3-Φ2-(x)2–3-Φ3-x-Φ4, where x denotes any amino acid. Later, structural studies of the CRM1 bound to NES peptides revealed another hydrophobic pocket in CRM1 that can bind to one more hydrophobic amino acid (Φ0)10,11. This site is less restricted to hydrophobic residues compared to others. Until recently, the existing 11 consensus patterns were defined by the peptide library-based study9 and structural analyses of CRM1-NES complexes11,12,13,14. They consist of four to five hydrophobic residues (Φ0-Φ4; generally, L, I, V, M, and F) which are bound to the corresponding hydrophobic pockets (P0-P4) in CRM1. Based on the pattern of these Φ’s and spacing sequences, the NES motifs are classified as class 1a, 1b, 1c, 1d, 2, 3, and 4. Additionally, compared to these classes, some peptides bind in the opposite (−) direction, making their Φ3-Φ4 positions bound to P0-P1 (class 1-reverse)13. Until recently, X-ray crystal structures of CRM1 bound to NES peptides of the 1a, 1b, 1c, 2, 3, 4, and 1a-reverse classes have been solved. Depending on the classes, the NES peptides showed distinct backbone conformations binding to the central portion of the hydrophobic groove of CRM1. One turn helix in the middle is remarkably conserved among all classes maintaining a hydrogen bonding with the Lys residue (Lys568) in human CRM114.

Modeling short motifs or patterns like NES is a major research area in bioinformatics. Since NES motifs are essential regulators of the subcellular location of proteins in relation to cancer, cell cycle, cell differentiation and other important aspects of molecular biology, prediction of the NES motif is of great interest but still remains a challenge. Until now, more than 300 experimentally identified protein cargoes are recorded in databases such as validNESs15 and NESdb16 and over 1000 putative CRM1 cargoes were identified in a recent proteomics study17. Based on the ever-growing repertoire of the protein cargoes of CRM1, many attempts were tried to employ machine learning approaches to decide whether a given sequence has a CRM1-dependent NES motif or not. Several computational tools, such as NetNES8, NESsential18, NESmapper19, LocNES20, Wregex21, and NoLogo22 have been developed to predict NES motifs. Most of them are sequence-based predictors based on consensus pattern matching and calculation of biophysical properties such as disordered propensity, secondary structure components, and solvent accessibilities. To capture the diversity of the NES sequences, the consensus patterns were generally applied in the form of regular expression or position-specific scoring matrix (PSSM). Unfortunately, NES patterns are quite commonly observed in a large portion of the proteome so that the prediction based on these consensus patterns results in a high false positive rate. Since a functional NES needs to be solvent-exposed and not buried in a globular fold, Kırlı et al. applied these criteria and pattern matching to identify NES motifs in a set of validated, new CRM1 cargoes and found that functional NES motifs still could not be identified in a significant portion of them17. Moreover, sequences of functional NES motifs appear to be more diverse than previously appreciated. A large portion of experimentally defined NES regions does not match the current consensus patterns17. As a solution to reduce the high false positive rate, other biophysical features such as disorder propensity, secondary structure component, and evolutionary conservation were incorporated into machine learning algorithms like support vector machines (SVM) or neural networks8,20. However, the false positive rates remain high. In addition to the ever-expanding NES patterns resulting in many false positives when used in NES prediction, the limited information about direct CRM1 binding of the annotated NES regions is detrimental to develop accurate predictors using available data sets. Therefore, predicting NES motifs using only protein sequence information seems to have limitations, and the combination with structure-based predictions could be a new strategy to distinguish NES motifs and false positives.

In this study, using validated cargo protein sequences in NESdb and validNES, we provide a comprehensive look-up table which contains the location of the NES consensus patterns with the disorder propensity plots, conserved domain information, and the predicted secondary structure. This information could be useful for determining the most plausible NES region in the context of the whole protein sequence and for suggesting possibilities for some non-binders of the annotated NES regions. In addition, for the first time, we adopted the structure-based prediction of the NES sequences bound to the CRM1’s NES binding groove, using multiple crystal structures of CRM1-NES peptide as templates. For several experimentally validated NES peptides and false positive ones, we calculated the relative binding energy of the sequence segments at the CRM1’s binding pocket, and the prediction reliability of these binding energies was validated by the experimental binding affinities. Combining sequence-based and structure-based predictions, we suggest the novel and more straight-forward approach to identify NES sequences that bind directly to CRM1.

Results and Discussion

Deducing NES consensus pattern-matching sequences in candidate cargo proteins

Using the validated cargo protein sequences in NESdb and validNES (which have Leptomycin B (LMB)-sensitive data as evidence of CRM1-dependency), we extracted the NES consensus pattern-matching sequence segments based on the modified version of the Kosugi consensus16,20 as summarized in Fig. 1. All the possible consensus patterns are recorded and prioritized by the empirical class priority (see Methods for details). Based on these criteria, 4226 consensus-matching segments were extracted for 318 cargo protein sequences. Among them, 463 segments were treated as candidate NES motifs as they occur in regions that overlap to experimental evidence, and 3763 were treated as false positives (FPs). The experimental NES regions of 54 cargo proteins do not match the current consensus and are not considered in this study. Also excluded are four cargo proteins with no reported NES regions and five cargos with long reported NES regions (>25 residues) that do not have specific residues annotated. Among the consensus patterns, class 1a is the most abundant class (41%) as expected. Especially, compared to the false positive sequences, class 1a is observed more than twice as often in the candidate NES sequences. Classes 1c, 2, and 3 follow with 14~15%, class 1a-reverse is observed in 8.6%, and classes 1b, 1d, 4, or 1c-reverse seem to be quite rare (Fig. S1).

Figure 1
figure 1

NES consensus patterns used in this study. For the hydrophobic positions, Φ1–4 are Leu, Ile, Val, Met, or Phe, and for the Φ1 and Φ2 positions, Thr or Ala is allowed for one position. Φ0 is not restricted to the hydrophobic amino acids. In the reverse classes, the criteria are applied in the opposite direction, and one of the Φ0 or Φ1 should be Leu, Phe, or Met. The spacer residues (x) can be any amino acid, but several positions have exceptions. The spacers in Φ2 [X]nΦ34 (or Φ01[X]nΦ2 in reverse classes) do not allow to have Pro or Trp. For class 4, at least one residue of the spacers in Φ3XXXΦ4 should be Pro to make a turn (as observed in the X-ray crystal structure of CRM1-X11L2 peptide).

A comprehensive look-up table of NES patterns in NES cargo proteins

In order to make the NES motif to be accessible to CRM1-binding, the motif should not be located in the compactly folded protein domains. The NES motif may be located at the N-terminus, at the C-terminus, or within an unstructured region of an export cargo11. Therefore, for a precise prediction of the export signals, it is crucial to consider the motifs’ location with respect to protein domains and disordered regions. For all possible NES consensus patterns of the cargo proteins that we extracted, we analyzed the relationship with the protein ordered/disordered regions, known domains, and their predicted secondary structures, and provide a comprehensive online table. For a given full protein sequence, we plotted the disordered propensity, the location of the known domains, the predicted secondary structures, and all possible NES consensus regions (Fig. 2). For a given entry, the information annotated in NESdb or validNES, such as evidence of CRM1-dependency, mutation data, functional sequences or sites, is listed together. The locations of all NES consensus-matching segments are marked together with the experimentally validated regions (Fig. 2A, the bottom of the plot). The reference databases (NESdb, validNES, and UniProt), protein visualization tool (ProViz)23 and the structure and model database (SWISS-MODEL repository)24 are linked for user convenience, and the filter for easy look-up is also provided. This table could be useful for determining the most likely NES region in the context of a whole protein sequence. The online table is accessible via: http://prodata.swmed.edu/nes_pattern_location/.

Figure 2
figure 2

Location of the NES consensus patterns in Snurportin-1. (A) Disordered propensity, conserved domain information, predicted secondary structure, and the location of the consensus patterns are plotted together. The defined ordered region (by the cutoff value of 0.1; gray dashed line) is represented by the sky-blue box at the top. The regions of the conserved domains annotated in smart, Pfam, NCBI-curated, and CDD are marked in the middle. The predicted secondary structures (SS) were colored by red, black, and blue for α-helix, coil, and β-strand, respectively. The gradient of the color corresponds to the confidence level of the prediction. For the NES regions, experimentally validated regions are displayed in blue (with mutation data annotated in NESdb) and cyan (annotated as a functional sequence in NESdb or as a site in validNES). All the consensus pattern matching segments are located at the bottom. Segments not in the ordered regions and without β-strand predictions in the middle are highlighted in yellow. The red boxes are the pattern-matching segments overlapping with experimental evidence. (B) The crystal structure of CRM1-SNUPN complex structure (PDB id: 3GB8)12. SNUPN is displayed by the cartoon, and the validated NES motif, Snurportin1 domain, and Snurportin-1_C domain are colored in red, green, and orange, respectively. CRM1 is represented by a white surface. (C) The list of the pattern-matching sequences in SNUPN. In the ‘candidates’ column, NES candidates and false positives are annotated with “cand” and “fp,” respectively. If the segment is located in the disordered or boundary region, it is flagged with “_D” while in the ordered region, it is flagged with “_O.” If the segment’s β-strand content is over 0.5, it is flagged with “_beta.” In the ‘sequence’ column, hydrophobic positions are colored in red, and the positions with the experimental evidence are marked with ‘*’ (mutation) and ‘+’ (functional sequence in NESdb or sites in validNES). The values in ‘diso,’ ‘spotd,’ and ‘iup’ are the average disordered propensity for the segment calculated by DISOPRED3, SPOT-Disorder, and IUPRED2A, respectively. The locations with respect to disordered/ordered region or conserved domains are listed in the ‘loc_DISO’ and ‘loc_CDD’ columns. ‘beta’ is for the β-strand content in the middle of the segment.

NES candidates in the disordered or ordered regions

Even if a sequence motif can be fitted to the NES consensus, a motif that is located deep in the globular fold can hardly bind to CRM1 unless the region unfolds. In some cases, it may be possible to unfold and bind, but we assume that these cases would be very limited. Also, short linear interaction motifs like NES motifs have been proposed to be locally disordered to facilitate dynamic interactions with their binding partners, and the NES prediction algorithms have used disorder context to help distinguish correct NES motifs from false predictions18,20. However, NES motifs do not necessarily have to locate in the fully disordered region. Indeed, we have observed that some NES candidates are located in the fully disordered regions, but others are located next to ordered or “boundary” regions. Therefore, we employed the disorder propensity as a pre-filter to remove the segments located in the “highly” ordered regions.

Various computational tools have been developed for analyzing potential intrinsic disorder of protein sequences and were quite successful owing to clear association between disordered propensity and sequence features such as low complexity or high aromatic composition. We utilized DISOPRED325 and SPOT-disorder26, which use homologous sequences’ alignment-based profiles for detecting disordered regions, and IUPred2A27 which is much faster since it does not rely on the sequence alignment. Disordered regions for some proteins are quite differently predicted depending on the programs. In order to define ordered and buried regions with high confidence, we applied strict cutoff values (~0.1) to decide the order/disorder border lines (note that the most of the programs’ cutoff value for disordered regions are ~0.5). If a residue’s disorder propensities predicted by both DISOPRED3 and SPOT-Disorder are below 0.1, the residue is defined as in highly ordered region (note that the predicted values by IUPred2A are also recorded for the reference).

As shown in Fig. 3A, 55% of the NES candidate motifs are located in the disordered region, and 37% are found in the boundary region between the ordered and disordered parts. Only 8% of the NES candidate motifs are located in the highly ordered region. Among the 361 candidate motifs, 37 segments (for 20 cargo proteins) are located in the highly ordered region which may have less possibility to be accessible to CRM1 binding. For example, HDAC1 (uniport ID: Q13547) has a reported NES motif with a mutation data (L158A/L161A/L164A) for nuclear export28. This region can be fitted to the classes 1c, 2, or 3, but it is located in the highly ordered region. The crystal structure of HDAC1 (PDB ID: 4bkx) showed that this segment is buried in the globular domain and seems unlikely to be accessed by CRM1 (Fig. 4A). Note that in case of its homolog HDAC5, the candidate NES motif (1081EEAETVSAMALLSVGA1096, class 1a) is located in the disordered region after the conserved Hist_deacetyl domain and found to directly bind to CRM1. The similar region (after the Hist_deacetyl domain) in HDAC1 (358YLEKIKQRLFENLRMLP374, class 1c) could be also considered as a possible NES motif of HDAC1. Table S1 lists the NES candidate motifs located in the highly ordered region and Fig. 4A,C shows some examples for these segments in the available 3D structures.

Figure 3
figure 3

Location of the candidate NES and false positive sequences. (A) Location with respect to the disordered or ordered regions. DISO: located in the disordered region; boundary: located at the end of the highly ordered region; ORD: located in the highly ordered region. (B) Location with respect to the known domains annotated in CDD. MID: located in the middle of the domain; boundary: located at the end of the domain; small: located in the small domain (<50 residues); NA: located in the region with no annotated information.

Figure 4
figure 4

Examples of possible non-binders to CRM1. (A) Segments located in the ordered globular fold. (B) Segments with β-strands in the middle. (C) Segments located in the ordered region and have β-strands in the middle. The hydrophobic residues are colored in red or green in the sequences and displayed as sticks in the structures.

In case of the false positives, the segments located in the highly ordered region is 19%, a larger percentage than those of the candidate NES motifs (note that the segments in the ordered region are far lower than those in the disordered region since we use the stringent cutoff for defining ordered region). The false positives in the disordered or boundary regions are 31% and 51%, respectively.

CDD domains and NES locations

To analyze the candidate NES motifs’ location with respect to the conserved regions, we extracted the conserved domain information for the cargo protein sequences using the four different databases, i.e., SMART, Pfam, NCBI-curated, and Conserved Domain Database (CDD). As shown in Fig. 3B, only 33% of the candidate NES regions are located in the middle of the CDD domains, and 40% is in the boundary region. It seems that the NES regions do not necessarily locate in the protein domains. Rather, the known domains are often considered to form folding units, masking the possible motifs from binding other proteins. In case of the false positives, more than half are located in the middle of the known domains. It may be because the hydrophobic residues are commonly located in the protein core or domains.

Secondary structure components of the NES peptides

Crystal structures of CRM1-bound NES peptides have been resolved for the classes 1a, 1a-reverse, 1b, 1c, 2, 3, and 4. They showed distinct backbone conformations that match their hydrophobic positions to the corresponding hydrophobic pockets in CRM1. Structural analysis, as well as secondary structure prediction of NES motifs, suggest that most NES motifs contain α-helices or helix-to-extended conformation12,13,14. The class 1d is also expected to have helix-strand, and other reverse (−) classes are likely the reverse of their (+) counterparts14. The common feature of the backbone conformations among the classes is one turn of helix at the region from Φ2 to Φ314.

In our analysis of the 361 candidate motifs, 36 segments (for 23 cargoes) have a β-strand conformation in the middle (β-strand contents of the middle part is >50%) (Table S2). Among them, 11 segments were confirmed to have β-strands in the available X-ray or solution structures. For example, NPM has two reported NES regions, but both of them are predicted to form β-strands in the middle of the segments. As shown in Fig. 4B, the two segments are both β-strands located in the middle of the jelly-roll fold. Indeed, both regions were also reported to be quite weak binders of CRM129 and the sequence of 42–61 failed to bind CRM1 in GST-pulldown assay (Chook Lab, unpublished results; annotated in NESdb). The candidate NES region in TDP-43 is also located in β-strands within a folded globular RRM domain, and it is recently validated to be a non-binder to CRM1 rather it is exported by passive diffusion30. For six segments, there is no experimentally determined structure, but homology models showed the β-strands for the segments. For 17 segments, no structural information is available. For two segments, the conformation in the modeled structures (with sequence identities of 79% and 98%, respectively) are found to be helix reflecting the limitation of the secondary structure prediction.

Evaluation of the stability of the NES peptides at the CRM1 binding groove based on structure modeling

Recent structural works of CRM1 complexed with various cargo sequences expand the possible consensus patterns13,14. Also, the NES-binding site in RanGTP-bound CRM1 is found to be quite rigid, and the peptides display CRM1-dependent NES activity only if their backbone conformations can place a sufficient number of the hydrophobic residues into the CRM1’s binding groove11. The adapting conformation of the peptides can be efficiently analyzed by structure-based modeling methods so that the application of the structural information can advance more accurate NES prediction.

Using the reported NES peptides with experimental binding affinities14,31 as a benchmarking set (Table 1), we evaluated the binding energy (Ebind) for a given peptide sequence at the CRM1 groove (see Methods for details). Binding energy can be assumed as relative stability of the protein(CRM1)-peptide(NES) complex structure compared to the protein itself and free peptide. The lower the binding energy, the higher the possibility for the peptide segments to bind at CRM1. Multiple crystal structures of CRM1-NES peptide (super PKI and MVM-NS2 for classes 1a; FMRP-1b for class 1b; SNUPN for class 1c; FMRP and SMAD4 for class 2; HIV-Rev for class2-rev type; X11L2 for class 4; and CPEB4 for class 1a-reverse; class 1a templates can be used to fit class 3 NES peptides) were utilized as templates. The model generation and energy calculation process are summarized in Fig. 5A.

Table 1 Peptide sequences of the validated NES motifs or false positives used in the structure-based modeling.
Figure 5
figure 5

Structure-based prediction of the stability of CRM1-NES peptide complex. (A) CRM1-NES peptide complex model generation and Ebind calculation procedure. (B) Generated models for the complex structures of CRM1-NES peptides with lowest Ebind. Class 1a peptides are displayed with CRM1 (in white) at the top. The hydrophobic (Φ) residues of these NES peptides (shown in the sticks) occupy the corresponding hydrophobic pockets (P0-P4) in CRM1. Peptides of other classes are shown at the bottom with the hydrophobic residues shown in the sticks.

Final model structures showed that all classes were predicted well with their Φ residues bound to the corresponding hydrophobic pockets (Fig. 5B). The calculated Ebind selected the right template for each class, and it can be utilized to find the most plausible class when multiple consensus patterns are found in one segment. The calculated Ebind values correlated quite well to the experimental KD values (Fig. 6, left; R2~0.63; Pearson’s r~0.79 with p = 2e − 6). However, in the case of the two PKI mutant peptides which have extremely low binding affinities, the Ebind scores are not quite distinguishable from those of the weak binders such as SNUPN, SMAD4, and HPV-E7. In case of the PKI double mutant peptides, we found a large interface cavity at the binding interface with CRM1 (Fig. S3A), but this feature, definitely detrimental to binding, is not well reflected in the modeling process or energy calculation. To penalize the interface cavity of the complex structure, residue solvent accessibility (RSA) for key interface residues (Fig. S3B) is calculated using the NACCESS program32 and treated as another scoring term. The RSA-corrected Ebind scores (EbindRSA) is obtained by calculating EbindRSA = Ebind + w∙RSA (w is the weight for the RSA term and is optimized to maximize the correlation) (Fig. 6, middle). EbindRSA gave improved correlation (Fig. 6, right; R2~0.73; Pearson’s r~0.86 with p = 5e-8).

Figure 6
figure 6

Correlation between the binding energies and the experimental KD values. The binding scores are averaged in the five independent runs (<Ebind>5runs; <EbindRSA >5runs for the RSA-corrected values) and compared to the logarithm of KD values (lnKD). The CRM1-binders with KD values are shown in filled markers with error bars which are the standard deviation during the five runs. The false positives are shown in orange empty markers. In the middle, the correlation between R2 and the weights for RSA during the Ebind correction is shown. The weight of 0.35 were applied for calculation of EbindRSA.

For comparison, several false positive sequences that can be fitted to NES consensus but are experimentally validated as non-binders (determined by pull-down binding assay)13,33 are subjected to modeling with the same procedure. Interestingly, these false positives showed significantly higher Ebind scores reflecting their low binding affinities at the CRM1 binding groove. Notably, the peptides such as COMMD1 (164DEVKVNQILKTLSEVEES181) and ELF3 (111RLVFGPLGDQLHAQLR126) were not fitted to the right template (i.e., the lowest Ebind complex is not the class 1a-R structure). It suggests that these sequences could be energetically unstable when their backbone conformations are fitted their hydrophobic residues to CRM1 hydrophobic pockets. In case of the false positive peptides fitted to the right template (Fig. 7), the backbone conformation and the Φ residues may appear to be pretty similar to the true positive ones; however, they showed inferior binding energies. In some cases, such as Cyclin D1 (Fig. 7A, middle) or FGF1 (Fig. 7C, right), the backbone conformation seems to be not maintained well when presenting the Φ side chains into the pockets.

Figure 7
figure 7

Comparison of the structural models and binding energies of the CRM1-binding NES motifs (blue) and false positive sequences (orange). CRM1 structure is colored in white. The hydrophobic residues are colored in red in the sequences and displayed as sticks in the structures. The spacer residues are represented as lines.

We expect the merit of this structure-based, energy-based method is to discriminate true positive and false positive with similar sequence patterns, by analyzing energetic differences at the CRM1 binding site via full-atom modeling. This atomic-level energetic analysis cannot be deduced by using the only sequence. In this perspective, our method would suggest novel approaches to find the CRM1-binding NES motifs. We cannot ignore the fact that the interaction between CRM1 and a whole cargo protein can be more than that of the CRM1-NES peptide10; however, it is extremely difficult to consider extra contacts between CRM1 and cargo’s whole structure which may be different depending on each cargo. Based on our previous result describing the strength of the CRM1-NES peptide interaction correlated to the nuclear export activity31, we assume that the energy prediction between CRM1 and NES peptide is a practical strategy.

For evaluating the performance, we compared our results to those of other sequence-based methods, i.e., NetNES8, NESmapper19, and LocNES20 (Figs S4S20). Using the whole sequences of 17 proteins in Table 1, we extracted 19 positive cases (regions annotated as NES motifs in the NESdb or validNES database with mutational evidence) and 341 negative cases (non-NES regions with consensus pattern-matching). As shown in Table S3, Ebind score performs the same as LocNES in terms of recall rate (both predicts 17 true positives out of 19 experimentally verified NES cases). On the other hand, Ebind outperforms LocNES in terms of specificity and false positive rate. Ebind recorded 23 cases of false positives while LocNES predicted nearly the double amount of false positives (40 cases). NetNES showed better specificity (true negative rate (TNR): 0.988) than our method (TNR: 0.933). However, its recall rate (sensitivity or true positive rate (TPR): 0.474) was much lower than our method (TPR: 0.895). Our method seems to work well enough compared to these available methods. It effectively decreases false positives while maintaining a high recall rate, showing the best performance with respect to the balance of precision & recall (F1 score), and effectiveness (DOR).

Possibility of non-binders to CRM1 among the NES-annotated regions

The databases like validNESs15 and NESdb16 provide valuable information on NES research, however, defining CRM1-dependent NES regions is still a difficult task. The expanding NES patterns result in many false positives. Also, the lack of information showing direct CRM1 binding to many annotated NES regions prevents development of accurate predictors using available data sets. Most published experimental studies were focused on showing that a protein is an export cargo, by deletion of the whole region encompassing a candidate NES or by mutation of all the suspected hydrophobic residue positions. These perturbations are drastic and may affect structural stability and result in defects of functions other than CRM1-binding and nuclear export. Therefore, one should interpret the experimental data carefully to identify the CRM1-binding NES location, and it is always possible that regions which have been annotated as experimentally validated are not in fact functional NES motifs. Indeed, some of the annotated NES regions were found in the buried (highly ordered) protein domains (Fig. 4A,C). Some others can form β-strands in the middle of the segment (Fig. 4B,C) which would be rare in real NES sequences. Candidate segments that form β-strands and are located in the ordered region are observed in three cargoes including FAK (91RSEEVHWLHVDMGVSS106), MoKA (190KIQTLHLVGVNVPE203), and Sirt1 (423DEVDLLIVIGSSLKVRP239). We suggest that these segments have high possibility to be non-binders to CRM1 unless they unfold or transform their conformations upon specific conditions. Some cargo proteins might be exported following other events such as binding to an NES-containing adaptor protein.

Even if a segment fits the NES consensus and also satisfies the location criteria, these criteria are still not enough to locate the real NES segments in the whole protein sequence (see yellow highlighted segments in the online table). We tested the Ebind calculation to the all possible segments of the natural cargo proteins listed in Table 1. If a segment cannot form an energetically stable complex at the CRM1’s NES binding groove, it is likely a non-binder to CRM1. As shown in Fig. 8, the NES candidates are likely to have the lower Ebind scores compared to other false positive segments. Among the seventeen cases, eleven cases have the NES candidate motifs with the lowest Ebind, and four cases have the NES regions with the second lowest Ebind but the difference between the lowest and second lowest is usually marginal (less than 2). Although the data set used in the structure-based modeling is quite small, the resulting binding energy values can discriminate between CRM1 binders and false positives. This structure-based prediction method can be utilized as one of the features to find real CRM1-dependent NES peptides in the pool of numerous false positive sequences.

Figure 8
figure 8

Distinguishing CRM1-binding NES motifs and false positives by Ebind. Location of NES consensus and their binding energies in (A) Snurportin-1 (O95149), (B) MEK1 (Q05116) and (C) FMRP (Q06787). The description for the plots is same as Fig. 2. The calculated Ebind scores for the important segments (pattern-matching segments which are not located in the highly ordered region and do not have β-strand conformation in the middle; yellow highlighted) were displayed together. The Ebind scores of the candidate NES motifs were underlined and marked in red. The classes of the consensus patterns are marked in parentheses.

Conclusion

In summary, we analyzed the structural prerequisites for CRM1-dependent NES motifs, i.e., accessibility (by locating disordered/ordered regions), adapting conformation (by predicting secondary structures), and the stability at the binding site (by applying structure-based modeling to calculate binding energies). The comprehensive table including all the possible consensus patterns with the disordered propensity plot, conserved domain information, and the predicted secondary structures provide valuable information for determining or correcting the most probable NES regions.

In light of the currently resolved crystal structures of CRM1-NES peptides with diverse classes, we modeled the CRM1-NES peptide complex structures and calculated the stability of the NES peptides at the CRM1 binding groove. The resulting binding energies correlate well to the experimental binding affinities, and we can distinguish the real NES motifs and false positives which both match NES consensus patterns. Also, we do not rely on the input sequence’s pattern, rather use the energy function to select the most energetically favorable class template. Therefore, if the multiple patterns exist in one peptide segment, this energy calculation can be a tool to predict the peptide’s conformation when it binds to CRM1. Although the method can still be improved, this study provides a starting point to predict NES motifs by combining sequence-based and structure-based approaches. Because our method is template-based modeling, it is difficult to adequately model NES motifs of classes other than those of the templates. Since newly discovered NES motifs often deviate from the established consensus patterns, more structural information is definitely needed not only to understand new consensus patterns and NES-CRM1 binding mechanism but also to more accurately predict NES motifs.

Methods

Extraction of the NES consensus sequences

For the cargo proteins which have LMB sensitive data as CRM1-dependency annotated in NESdb16 and validNES15, the NES consensus-matching sequence segments were extracted by utilizing the modified version of the Kosugi consensus16,20 (Fig. 2): Φ1-X1,2,3-Φ2-[^PW]2-Φ3-[^PW]-Φ4; Φ1-X2,3-Φ2-[^PW]3-Φ3-[^PW]-Φ4; or Φ1-X2-Φ2-X[^PW]2-Φ3-[^PW]2-Φ4 ([^PW] is any of the 20 amino acids except Pro and Trp; Ala or Thr can be used only once at Φ1 or Φ2; X stands for any amino acid). If one segment or segments in the similar region (difference between the two segments’ starting residue numbers <5) can be fitted to multiple patterns, all the possible patterns are recorded but prioritized based on the fact that: (i) the class 1a pattern is the most frequently observed class in the validated NES sets, suggesting that it interacts more preferentially with CRM1 than other classes9,16,22; (ii) in the current NES databases, class 3 sequences are as prevalent as NES motifs of classes 1c and 213; (iii) the classes 1b and 1d can be found only in a few NES sequences, and the majority of the class 1d sequences can be overlapped to the class 1a pattern in the validated NES sets9,13; and (iv) reverse(−) of classes 3 and 4 appears to lack β-strands to hydrogen bond with the Lys residue and may not be ideal NES motifs14. This empirical class priority is defined as follows: (i) class 1a with five Φs (c1a-5) as priority 1; (ii) class 1 with four Φs (c1a-4), classes 1a-R, 2, 3, and 4 as priority 2; (iii) classes 1a/1c with Thr or Ala in one of their Φ1 or Φ2 positions as priority 3; (iv) classes 1b, 1d, 1c-reverse, and classes 2/3 with Thr or Ala in one of their Φ1 or Φ2 positions as priority 4, and (v) classes 1b/1d with Thr or Ala in one of their Φ1 or Φ2 positions as priority 5. The extracted regions are from the one residue before Φ0 to the two more residues after Φ4 (or shorter if located at the protein C- or N-termini). If the Φ2-Φ4 portion of the extracted region overlaps with experimental evidence (annotated as “mutations that affect nuclear export,” “mutations that affect CRM1 binding,” or “functional export signal” in NESdb, or annotated as “sites” in validNES), it is considered as a candidate NES. If not, it is deemed as a false positive.

Calculation of disorder propensity and definition of ordered regions

The disorder propensity of the cargo protein sequences is calculated using three different programs, DISOPRED325, SPOT-disorder26, and IUPred2A27. For DISOPRED3 and SPOT-disorder calculation, which is based on multiple sequence alignment, uniref90_2015_0134 database is used to find homologs during PSI-BLAST search35. In order to define ordered regions with high confidence, we applied strict cutoff values (~0.1) to decide the order/disorder border lines (note that the default values for disordered regions of these three programs here are ~0.5). If a residue’s disorder propensities predicted by both DISOPRED and SPOT-disorder are below 0.1, the residue is defined as ordered (“O”). If not, the residue is recorded as potentially disordered (“D”). The predicted values by IUPred2A is also recorded for the reference. The sequence segment’s location is determined by scanning the portion of “D” or “O” in the segment and flanking residues (20 residues at both sides) (Fig. S2A). If the portion of “D” mark is more than 90% for the segment and flanking regions, the location of the segment (loc_DISO) is defined as an ordered region (“ORD”). If “O” is more than 90%, the location is determined as a disordered region (“DISO”). The other segments are considered as the ones located in the “boundary” region. The segments in the boundary regions can be found at the end of the ordered regions, or they can locate in the ordered regions where some portions (>10%) have higher disorder propensity than the cutoff value.

Extraction of the conserved domain information of the cargo proteins

By using the Batch CD-search tool36, the conserved domain information for the cargo protein sequences was extracted. Four different databases, i.e., CDD (cdd v3.16), NCBI_Curated (cdd_ncbi v3.16), Pfam (oasis_pfam v3.16), SMART (oasis_smart v3.16), were searched with the expect value threshold of 0.01. The results were retrieved by the Concise mode.

Prediction of secondary structure

Secondary structures of the cargo protein sequences are predicted by PSIPRED Version 3.2137. During PSI-BLAST search35 to find homologs, uniref90_2015_0134 database is used. In the online table, the confidence level of the prediction is also colored by a gradient from dark (high confidence) to light (low confidence).

Relative binding energy (Ebind) prediction

Ten crystal structures of CRM1 bound to various NES peptides, including MVM-NS2 (PDB ID: 6CIT31), super PKI (unpublished data), FMRP-1b (5UWO14), SNUPN (3GB812), FMRP (5UWJ14), SMAD4 (5UWU14), HIV-Rev (3NBZ11), X11L2 (5UWS14), and CPEB4 (5DIF13), were utilized as templates. For the CRM1 part, we extracted the residues from 479 to 655 (numbered in scCRM1) to reduce the computation time. For potential NES peptides, the positions from Φ0–1 to Φ4 + 2 positions were modeled (or a shorter segment in case a sequence used in the experimental KD measure is shorter). A given peptide sequence is fitted to the backbone coordinates of every template structure. By using the Rosetta backrub module38, the backbone conformations of the fitted NES peptide and the surrounding helices in CRM1 are sampled to generate 50 models (50,000 backrub Monte Carlo trials/steps were run for each model). Among them, five complex structures with the lowest energy are selected and then optimized by the Rosetta relax module39,40, which searches the local conformational space around the starting structure. The relaxation was carried out 50 times for each model (i.e., the total number of models for a given peptide sequence is 10 × 50 = 500 models) with ‘-use_input_sc -ex1 -ex2’ flag for more rigorous search. The backrub-modeled backbone conformation was constrained during the relaxation by applying ‘-constrain_relax_to_start_cords’ flag. Structures of the CRM1 protein itself and the free peptide are also modeled separately with the same process. The all-atom energy function REF15 in Rosetta v.3.9 were utilized for all calculation.

The binding energy (Ebind) is calculated as Ecomplex − Eprotein − Epeptide. The values for Ecomplex, Eprotein, and Epeptide are the average of the lowest 10 energy values among the 500 models. For Epeptide, we utilized the lowest Epeptide among the all different backbone fitted models. Among the various template-fitted models, the one with the lowest Ebind score is selected. The Ebind scores were corrected with a solvent accessibility term calculated by the NACESS v.2.1.1 program32, which calculates the atomic accessible surface defined by rolling a probe of given size around a vdw surface. To penalize the cavity at the interface of CRM1 and low-affinity binders (such as PKI double mutant), the RSA values for the hydrophobic residues at the interface (Fig. S3) were extracted and added to the Ebind scores with the optimized weight.