Introduction

As of March 2023, the World Health Organization reports that SARS-CoV-2, the pathogen responsible for COVID-19, has killed over 6.8 million people worldwide since the start of the pandemic1. This deadly virus infects the epithelial cells lining the respiratory tract by fusing its spike (S) protein with the Angiotensin Converting Enzyme 2 (ACE2) receptor on the host cell2,3,4. Although vaccines have been instrumental in curbing death rates, their efficacy has been observed to wane after 5–8 months and this, along with the rise of troubling new variants, creates a concern over the possibility of continuing infections of COVID-195. Given the S protein’s critical role in the initial phases of infection, it is one of the most important drug targets and so understanding its architecture and mechanics remains a relevant research topic. The S protein is thus composed of three identical chains where each chain is made up of two subunits (S1 and S2)2. While the S2 subunit forms the fusion peptide, the S1 subunit is in turn composed of four domains: the N-terminal domain (NTD), C-terminal domain 1 (CTD1), C-terminal domain 2 (CTD2) and the receptor binding domain (RBD)6,7. Binding between the S protein and ACE2 thus begins with a conformational change of the RBD domain from the “down” position to the “up” position. Previous studies investigating various aspects of the S protein’s motions and mechanics include work that identified a latch region on the S protein that appears to regulate these “down” to “up” conformations8. Another study has identified centripetal motion of the RBD domain upon binding with ACE2 along with specific residues that played an important role in those motions7. Other work has taken the approach of considering the effects of dihedral angles associated with “down” to “up” conformational changes with another study obtaining insight into the entropy of residues most amenable to binding, by similarly looking at dihedral angles9,10. Although these approaches have yielded very important information in the fight against COVID-19, a dedicated characterization of the dihedral angles themselves is still lacking that specifically probes phenomena pertinent to time-informed dihedral angle fluctuations, such as, frequencies of oscillations and identifying potential dominant frequencies throughout the S protein. Consequently, the precedent currently remains modest for the questions posed by our work which motivates these present efforts.

For this work, the objective is to characterize the frequency domain of steady-state dihedral angle oscillations of the S protein. For this reason, time regimes such as the microsecond time-scale were avoided in order to preclude non-steady-state phenomena such as sudden domain motions and instead focus on dihedral angle dynamics at smaller time scales. The approach therefore consists of analyzing dihedral angle data (phi and psi), as a function of time, obtained from molecular dynamics simulations from our previous study of the trimeric S protein (PDB ID: 6VSB)8. The RBD domain of chain A for this structure is in the “up” position. For every residue, we analyzed frequency data obtained from the time domain to look for any frequencies that have a dominant/majority presence in individual chains, among amino acids (i.e. among all ARGs), as well as the whole protein. We also curated this data to see if there were any frequencies that uniquely persisted for any of the twenty amino acids throughout the simulation period. Lastly, we attempted to identify correlations between each residue’s motility in Euclidean space and its dihedral angle fluctuation variations. Phi and psi angle data are thus analyzed separately.

Materials and methods

Molecular dynamics

The trajectory of protein motions was obtained from previous work8 which used explicit solvent molecular dynamics simulations of the SARS-CoV-2 S protein using the NAMD2 program11 to carry out the simulation for a total of 200 nsec after equilibration with trajectories being written every nanosecond. Using CHARMM-Gui12, the protein was thus explicitly solvated with TIP3P water molecules and the CHARMM36m force field was selected. Missing residues in the experimental structure were added and disulfide bonds and glycosylated sites were also included. The simulation was carried out using the NPT ensemble which maintains the number of simulated particles, pressure and temperature constant. We used the Langevin piston method to maintain a constant pressure of 1 atm and we employed periodic boundary conditions for a water box simulation volume along with the particle mesh Ewald (PME) method with a 20 Å cutoff distance between the simulated protein and the water box edge. The integration time step was 2 femtoseconds and our simulation was conducted under physiological conditions (37 °C, pH of 7.4, physiological ionic strength with NaCl ions, LYS and ARG were protonated, HIS was not).

Signal processing

Due to there being no sudden forces applied to the protein during the simulation that could result in steady state changes to the vibrational frequencies, for every residue, the Fourier transform was obtained from the phi vs. time and psi vs. time dihedral angle data (coming from the molecular dynamics simulations), using the fast Fourier transform algorithm from the JDSP library13 for the Java programming language (underlying calculation in Eq. 1). Positive absolute values of frequencies were thus calculated for the Fourier transform and the sampling frequency was once per nanosecond. This sampling frequency was deemed reasonable to avoid aliasing since information in the literature addressing protein dihedral angle timescales recognizes that, for example, chi dihedral angle changes associated with side chain rotations are on the order of nanoseconds14,15. Frequency domain data consequently spanned from 0 MHz to 496.094 MHz, just under the Nyquist/folding frequency of 500 MHz set by our fixed sampling rate (Nyquist frequency being half the sampling rate).

$$f^{o} \left( \omega \right) = \mathop \smallint \limits_{ - \infty }^{\infty } f\left( t \right)e^{ - i\omega t} dt$$
(1)

Equation 1: Fourier transform calculation to go from the time domain (t) to the frequency domain (\(\omega\)). f(t) represents the input time domain data (i.e. time domain data output from the simulation) to be transformed to the frequency domain. The exponential corresponds to the Fourier coefficients for representing the input function as a Fourier series. \(f^{o} \left( \omega \right)\) represents the output function in the frequency domain.

Data analysis

Steady-state regime of the protein’s fluctuations was confirmed by the Define Secondary Structure of Proteins (DSSP) algorithm, root mean square deviation (RMSD) and radius of gyration (RoG) metrics (Fig. 1)16. In order to identify frequencies that might be particularly abundant in the S protein, each time a peak appeared at a given frequency in the Fourier transform for an individual residue, it was tallied (the location of a peak at a given frequency in the individual residue’s spectrum being indicative that the dihedral angle possessed that frequency as a component of its oscillations). Peaks were identified as local maxima in the frequency domain data and so the DC component frequency at 0 MHz (typically a large value relative to the spectral peaks in our data) was disregarded. Such a tally was carried out as three independent analyses: 1) at the amino acid level, 2) at the individual chain level and 3) for the whole protein. Tally information was then plotted to visualize the abundance of the frequencies for each of the three cases. Next, root mean square fluctuations (RMSF) were calculated for every residue, and in order to quantify the tendency of dihedral angles to step outside of their respective mean (potentially a foreshadow to protein structural changes), population standard deviations were calculated for both phi and psi for each individual residue over the simulation period. These two metrics were then plotted (RMSF vs. phi/psi standard deviation) to see if there were any correlations observed between the two quantities. The standard deviations for phi and psi were then separately plotted as histograms to infer the spread of this variation. Lastly, in order to identify if any frequencies were unique to an amino acid (i.e. unique to, and persistent across all LYS in the protein), we carried out SQL inner table joins of the frequency data where peaks were specifically present, organized by amino acid (the table joins being responsible for identifying which frequency values were present across all the data for the amino acid being analyzed i.e. across all LYS). The complete data set is found in the electronic Supplemental Information material available at https://osf.io/n7tyg/.

Figure 1
figure 1

DSSP versus time, RMSD versus time and radius of gyration vs. time plots confirming steady-state regime of chain A. (a) DSSP versus time graphic of residues I100–Y200 for 60 nsec–200 nsec, for brevity, showing unchanging secondary structure over time (yellow highlights are beta strands and green highlight is alpha helix); data shown in increments of 10 nsecs, (b) RMSD versus time plot over the same 60 nsec–200 nsec displaying stable RMSD value over time, (c) radius of gyration vs. time for the same 60 nsec–200 nsec showing stabilized value over time (i.e. protein does not deform/change shape over time). Similar steady-state behavior was observed for chains B and C.

Results and discussion

Tallying all of the frequencies across the spectrum revealed that the most abundant frequency, throughout the entire protein, was 42.969 MHz, for both phi and psi dihedral angles, and the frequencies of 62.500 MHz and 27.344 MHz were observed as the second and third, respectively, most prevalent frequencies likewise for both phi and psi. Secondly, the profile of the plotted tally counts also showed that there was a regular alternation of high counts followed by low counts along the frequency spectrum and this difference in counts gradually became less pronounced as the frequency increased. This profile behavior was also observed for the plotted tally counts for just the individual chains, and to a lesser degree, for the amino acid level as well, within a given chain (see Fig. 2). The third observation was that for amino acids that had a relatively low abundance in the protein (HIS, MET and TRP), at the amino acid level within any given chain, it was observed that there were frequencies at which the dihedral angles (both phi and psi) never oscillated (see Fig. 3). There were between 9 and 15 of each of these three low abundance amino acids per chain, where each chain had a total of over 1000 residues thus establishing the low representation/abundance of HIS, MET and TRP in the protein. These un-accessed frequencies consequently appear as gaps in those plotted tallies. At the level of the whole protein (i.e. when tallies from all chains were combined), however, HIS in particular was observed to have populations of frequency counts that covered the whole spectrum for both phi and psi (Fig. 4). MET and TRP on the other hand still had un-accessed frequencies/gaps at the whole protein level for the phi dihedral angle, which were found at the lowest frequencies, but were able to cover the entire spectrum with psi (Fig. 4). Interestingly, despite the fact that the three constituent chains are identical in sequence, the un-accessed frequencies were not identical across the three chains.

Figure 2
figure 2

Tallied frequencies for all residues across all three chains for: (a) the phi dihedral angles showing bar chart in top panel and line plot in bottom panel (for ease in seeing profile), (b) the psi dihedral angle showing bar chart in top panel and line plot in bottom panel (for ease in seeing profile). (c) Tallied phi frequencies for all residues in the A chain and d) tallied phi frequencies for all threonines in the A chain (as an example of tally profile at the amino acid level).

Figure 3
figure 3

Frequency tallies showing un-accessed frequencies for: (a) all histidines in the A chain, phi and psi in the top and bottom panels respectively, (b) all methionines in the B chain, phi and psi in the top and bottom panels respectively, (c) all tryptophans in the C chain, phi and psi in the top and bottom panels respectively.

Figure 4
figure 4

Tallied frequencies across all three chains for: (a) all histidines, phi and psi tallies in the top and bottom panels respectively, (b) all methionines, phi and psi tallies in the top and bottom panels respectively, (c) all tryptophans, phi and psi tallies in the top and bottom panels respectively.

For each individual chain, plotting all of the residue’s RMSFs as a function of the standard deviations of their psi dihedral angles revealed that there was a larger quantity of residues with relatively large RMSF values corresponding to the larger values of the psi standard deviations (over 90°) compared to phi (Fig. 5). This was in contrast to the results observed for the phi standard deviations where there were fewer residues at these higher values of the phi standard deviations and what residues were present, they had relatively lower RMSF values (Fig. 5). The identity of the residues for both of these cases for standard deviations greater than 90° is thus shown in Table 1, specifically, for those residues with the top twenty RMSF values (phi, however, consistently did not have a full set of 20 amino acids at this upper range of standard deviations). Histogram inspection of the distribution of the values of the standard deviations, for both phi and psi, also revealed that both dihedral angles had a positive skew profile in their values (Fig. 6). Looking at each amino acid throughout each individual chain (i.e. all LYS in chain A, then in chain B, then chain C separately) revealed that there were no frequencies whatsoever that persisted across, or were unique to, any given amino acid throughout the simulation.

Figure 5
figure 5

RMSF versus dihedral angle standard deviation for: (a) chain A, phi and psi in the top and bottom panel respectively, (b) chain B, phi and psi in the top and bottom panel respectively and (c) chain C, phi and psi in the top and bottom panel respectively.

Table 1 Identity of residues with the top 20 RMSF values for dihedral angle standard deviations over 90° for: a) chain A phi, b) chain A psi, c) chain B phi, d) chain B psi, e) chain C phi and f) chain C psi.
Figure 6
figure 6

Dihedral angle standard deviation distributions showing positive skew profile for: (a) chain A, phi and psi in the top and bottom panel respectively, (b) chain B, phi and psi in the top and bottom panel respectively and (c) chain C, phi and psi in the top and bottom panel respectively. All histogram distributions done with 50 bins.

Several groups have previously identified residues important to either drug binding, allosteric effects, or both, for the S protein. According to work by Alvarado et al., LYS369 and PHE377 appeared to have important allosteric properties since binding of luteolin to these residues in the S protein induced an intense allosteric effect17. Verkhivker et al. reported that GLU406, ASN439, LYS417 and ASN501 were centers of allosteric interactions implicated in mediating long-range communications in the binding of the S protein with ACE218. Of these residues, LYS417, ASN501 along with the additional residue of GLU484, were further identified by Verkhiker et al. to be important interacting centers that provide mutants at these positions with better binding affinity to ACE2. Xue et al. additionally identified PHE329 and PHE515 as residues critical to centripetal motions of the RBD domain upon binding with ACE27. Drew and Janes also identified a druggable pocket and thus reported the amino acids lining this pocket19. Our own data showed that the dihedral angle vs. time profiles (phi, psi or sometimes both) for these very same residues, which were identified by the above previous groups, happened to have a distinctive profile that consisted of either a dihedral angle rapidly assuming a value far from the baseline, then immediately returning back to baseline, or the baseline would change very suddenly (Fig. 7). We note there were no distinctive frequency domain profiles for these residues, however. These unique time domain profiles were in contrast to those observed for most other residues whose dihedral angle values instead oscillated within a single fixed bandwidth of dihedral angle values (Fig. 7). Consequently, we present a list of some other residues which were identified in our dynamic data as having similar profiles, for further investigation as potential residues of interest to the drug design community (Fig. 8 and Table 2). Solvent accessible surface area (SASA) data for each residue is also provided to give preliminary insight into each residue’s potential accessibility to a drug-like compound (SASA data reported is for the conformation/trajectory-snapshot with the greatest SASA total for the whole protein). The complete list of suggested amino acids is in the electronic Supplemental Information. For completeness, we also present residues known to be important mutation sites which also happened to have distinctive time-domain profiles (Table 3)18,20. We also note that the above residues reported from the literature with distinctive time-domain profiles were not the same residues that had large dihedral angle standard deviations, suggesting that a high degree of angle rotatability does not necessarily foreshadow or correlate with allostery or pocket formation as originally speculated. Lastly, it is important to mention that although our workflow does not presume to replace existing methods to identify allosteric sites, binding pockets, etc.… our data might provide a motivation for possibly including dihedral angle time-domain data as a consideration in identifying allosteric sites, binding pockets, etc.…

Figure 7
figure 7

Dihedral angle values versus time for residues identified in literature as either important to allostery or the formation of a druggable pocket: (a) PHE 37717 chain C phi, (b) PHE 3297 chain B psi, (c) ASP 19819 chain C psi, (d) ASN 43918 chain A psi. Dihedral angle values versus time for most other residues: (e) PRO 39 chain A phi, (f) ILE 332 chain B psi.

Figure 8
figure 8

Dihedral angle versus time data for additional residues from our data set with time domain profiles similar to those residues previously identified as relevant to allostery and drug pocket formation: (a) ASN 211 chain A phi, (b) LEU 858 chain A psi, (c) THR 124 chain B phi, (d) THR 1077 chain B psi, (e) PHE 347 chain C phi, (f) PRO 82 chain C psi.

Table 2 Additional residues with time domain profiles potentially indicative of some relevance to additional allostery or drug pocket formation.
Table 3 Residues important to SARS-CoV-2 mutations with distinct time-domain profiles.

Conclusions

Given the importance of the S protein as a drug target for COVID-19, in this study, we characterized the oscillatory behavior of the protein’s phi and psi dihedral angles in order to identify the dominant frequency components of oscillations. The findings suggest that, for the SARS-CoV-2 spike protein: 1) both phi and psi dihedral angles are dominated by a few discrete relatively low frequencies in the 23–63 MHz range with 42.969 MHz being the most prevalent frequency sampled by the oscillations, 2) the protein appears to possess some sort of “harmonic series” type property with the frequencies sampled, as observed by the alternating high-low–high profile of the tallied frequencies suggesting the protein is more “in tune” with some frequencies compared to others (we refer here to the harmonic series of acoustics), 3) frequency components sampled by dihedral angles may be a function of position in the primary structure since amino acids in relatively low abundance in the protein never oscillate at certain frequencies in contrast to relatively abundant amino acids (i.e. the more positions an amino acid is found in, the more frequencies it can sample), 4) sampled frequencies also appear to be impacted by the micro-environment from non-bonded atoms since even though the three protein chains are identical in sequence, un-accessed frequencies for low-abundance amino acids were not identical across the three chains, possibly due to the chains’ spatial orientation differences given one chain is in the “up” position, 5) highly motile residues in Euclidean space also appear to have a wider degree of psi angle rotatability as observed by the positive correlation between high RMSF values and large standard deviation values for psi compared to phi and 6) specific residues identified in the literature as constituting the inside of a druggable pocket as well as other residues identified as allosteric sites, or important mutation sites are observed in our data to have distinctive time domain profiles. This last observation motivated us to propose other residues from our dynamic data with similar time domain profiles, which could possibly be associated with additional pocket formation or allostery phenomena since it has been observed that druggable pockets, for example, are dynamic and transient in nature21. Naturally, future dedicated work would be needed to confirm the significance of our proposed residues as well as whether distinct time-domain profiles are correlated with allostery, mutation sites, etc.… for generalized cases. In all, our present work therefore suggests that there is structured behavior hidden in the seemingly random time domain dihedral angle fluctuation data of the S protein. We postulate, however, that many of these observations were likely made possible due to the large size of the protein which permitted a wide sampling of frequency components in the dihedral oscillations for both phi and psi. Nevertheless, further studies are needed to confirm if this behavior is universal to all proteins, regardless of size, or if it is unique to the SARS-CoV-2 S protein. Additional studies involving a plethora of different proteins with a wide size distribution would also be needed to determine the physical significance of the most prevalent frequency components (high or low) as they relate to protein size and composition, for instance. We also note that additional studies would be needed to identify if the frequency domain (such as the DC component in our data) is impacted by actual physics in the system, such as the effects of the quantity of ions and water and temperature, or if phenomena, such as the DC component, are purely artifactual (we call to mind how vertical measurements due to gravity in accelerometers such as those found in smartphones can cause large DC components in those systems as well). Our present work would thus serve as a reference for such a future study.