LIM domain-wide comprehensive virtual mutagenesis provides structural rationale for cardiomyopathy mutations in CSRP3

Cardiomyopathies are a severe and chronic cardiovascular burden worldwide, affecting a large cohort in the general population. Cysteine and glycine-rich protein 3 (CSRP3) is one of key proteins implicated in dominant dilated cardiomyopathy (DCM) and hypertrophic cardiomyopathy (HCM). In this study, we device a rapid in silico screening protocol that creates a mutational landscape map for all possible allowed and disallowed substitutions in the protein of interest. This map provides the structural and functional insights on the stability of LIM domains of CSRP3. Further, the sequence analysis delineates the eukaryotic CSRP3 protein orthologs which complements the mutational map, but provide limited information of amino acid exchanges. Next, we also evaluated the effect of HCM/DCM mutations on these domains. One of highly destabilising mutations—L44P (also disease causing) and a neutral mutation—L44M were further subjected to molecular dynamics (MD) simulations. The results establish that L44P substitution affects the LIM domain structure by altering secondary structure and due to loss of hydrophobic interaction with Phenylananine 35. The present study provides a useful perspective to our understanding of the role of mutations in the CSRP3 LIM domains and their evolution. This study provides a novel computational screening method for quick identification of key mutation sites for specific protein structures that can reduce the burden on experimental research.

www.nature.com/scientificreports/ position was individually substituted by 19 other amino acid residues. This mutational landscape map can be used to identify potentially deleterious mutations that have not been experimentally verified earlier. This approach coupled with sequence and structural analysis expands our overall knowledge on disease causing mutations and their effects on protein structures, which traditionally can be challenging to achieve by experimental approaches.
To consolidate this, we selected a highly destabilising mutation-L44P and the neutral mutation-L44M. Many studies have mentioned role of L44P mutation in cardiomyopathy disease 4,18,19 . Mehroz Ehsan and group showed that L44P mutation leads to CSRP3 depletion by proteasome action 4 . Separately, a LIM domain study in chicken CSRP1 has reported that Leu 44 is a conserved residue involved in packing of the CCHC and CCCC modules of LIM1 via a hydrophobic interface 15 . Our analysis of the disease-causing mutation-L44P showed that this is a critical residue for the structure. A substitution to proline at this position leads to alteration of the domain's hydrophobic core and hydrogen bonding interactions. We further performed MD simulations, residue conservation analysis and created correlation maps to infer the significance of this leucine position (L44). Subsequently, we also highlight the conservation of CSRP3 protein in representative eukaryotes, and differences in LIM1 and LIM2 that augment the disease-causing mutations and non-overlapping functional roles of the two LIM domains.

Results
CSRP3 structure, SNPs, and stability. CSRP3 disease mutations are deleterious. HCM and DCM mutations of CSRP3 obtained from the HGMD and large-scale study by Roddy and group 16 were used in the study. This collection sums to a total of seventeen missense mutations in CSRP3 (Fig. 1A). These mutations were found to be frequent in LIM domains, predominantly in the LIM1 (Fig. 1B). For mutations A50T, S46R, A51D, C58G 4,18 , L44P 18,19 , R64C, R64H, T47M, Y57S and Y66C, the NMR structure, PDB ID: 2O10 (total 19 conformers) was utilized and for V127I mutation, PDB ID: 2O13 (total 20 conformers) was selected. For the sequence-based approach, the Polyphen2 and PROVEAN sequence predictions showed that the majority of reported HCM/DCM mutations are destabilising. Of these 17 mutations, Polyphen2 prediction revealed that 14  Mapping missense mutations on the CSRP3 structure revealed that most mutations lie in the LIM1 domain. FoldX analysis highlighted that these mutations were either destabilising or neutral (Fig. 1C). Further analysis showed 6 of these 11 (54.55%) mutations as destabilising and 5 (45.45%) of these as neutral mutations. Since LIM domains used in this study are NMR solved models, to rule out our conformation selection bias on FoldX prediction, we performed the same analysis on the remaining conformers as well. Our prediction was found to be similar in all the conformers used and did not have a bias towards representative conformer selection. Therefore, we selected the first conformer in each case as the best representative structure for further analysis. Interestingly, two known important mutations L44P 18,19 and V127I 16 were highly destabilising mutations (Fig. 1C,D).
Mutational landscape reveals highly conserved and substitutable positions. As previously identified LIM domains are the hotspots of HCM/DCM mutations. We carried out all-versus-all substitution (original amino acid replaced with 19 other amino acid residues) at each position of LIM domains. This approach gave a comprehensive map of allowed and disallowed substitutions at every position of the LIM domain. Further, as a control, we mutated a representative set of native residues in LIM1 domain to themselves in order to rule out possibility of an artifact. This analysis showed that self-mutation does not alter energy dramatically and all selfsubstitutions were neutral (Supplementary Information Table S1). Figure 2A Fig. S1A). In addition, to understand the effect of each mutation on the total compactness of the LIM domains, we calculated the solvent accessible surface area (SASA). It was evident that surrounding regions of prominent residue positions showed more changes in SASA as compared to the position in question itself (Supplementary Information Fig. S2A). Surprisingly, LIM2 domain was less tolerant than LIM1, as 11/57 (19.29%) residue positions showed all 19 substitutions as deleterious (Fig. 2B). In addition, there were 11/57 (19.29%) residue positions with relaxed criteria (additional single substitution) as deleterious. Moreover, we observed only 5/57 (8.77%) residue positions with all substitutions as neutral ( Supplementary Information Fig. S1B). The map revealed that, besides Cysteine and Histidine at zinc-binding site, many other residues like Leucine at position 44 are essential for LIM1 ( Fig. 2A). Similarly, SASA was calculated for LIM2 domain. Figure S2B highlights that many positions in LIM2 domain show significant changes in solvent accessibility. To summarize, the above www.nature.com/scientificreports/ results indicate that LIM domains are composed of both conserved and substitutable residues that allow it to have fixed topology and a wide range of functional interactions simultaneously. The mutational map underscores that all amino acid substitutions at residue position 44 were destabilising except hydrophobic residue L44M ( Fig. 2A). Since the previous results in this study highlighted that L44P was one of the highly destabilising HCM/DCM mutations, we focused on position 44 for further analysis. Methionine was also observed to be a suitable substitution at this position in LIM1 domain apart from the conserved leucine. To further investigate the roles of L44M and L44P in the structural stability, we carried out MD simulations of WT, L44P and L44M for a 100 ns simulation interval. MD simulations provides information on changes in protein conformation at given conditions, such as temperature and pressure, across time. We introduced L44P and L44M mutations using Maestro (Schrödinger Release 2020: Maestro, Schrödinger, LLC, New York, NY, 2020).
Correlation map analysis of LIM1 highlights differential dynamics of WT, L44M and L44P. We carried out cross-correlation dynamics in the WT, L44M and L44P mutant trajectories. The inter-residue crosscorrelation map revealed that WT shows visible differences compared to L44P, whereas the cross-correlation map of WT and L44M bear higher resemblance (Fig. 3). Focused analysis for residues 43, 44 and 45 shows substantial variations in the cross-correlation. It should be noted that mutation of leucine to proline at position 44 imposes a greater loss of contacts in distant regions of LIM1 domain as compared to the wildtype and L44M replacement (Fig. 3). The inter residue contact of WT, L44M and L44P were projected on two-dimensional map ( Supplementary Information Fig. S3). We observed no stark difference in the contacts of WT, L44M and L44P trajectories. These findings suggest that L44P mutational effect on residue interactions is not extreme in the vicinity of L44 but rather has long range effects in the LIM1 domain.  Figure 4A reveals that extended β-strand conformation in the region 42-44 is lost in L44P mutation, while is still intact in the WT throughout the 100 ns timeline. In addition, there is a significant loss of isolated bridge conformation at position 43 in the L44P trajectory. Figure 4A also highlights the significance of the more acceptable substitution of L44M. In the case of the L44M neutral mutant, whilst the extended β-strand conformation is disrupted, the isolated bridge conformation is preserved throughout the trajectory timeline. In LIM1 domain structure, hydrophobic and hydrogen bond interactions are affected in the L44P substitution (Fig. 4B). The Leu 44 (WT) residue forms hydrophobic interactions with Phe 35, Val 49, and Ile 56. The hydrophobic interactions were observed to be retained in L44M too. In contrast, when Leu 44 is replaced with Pro, hydrophobic interactions with Phe 35 are lost. Since hydrophobic interactions form the core of the compact structure of LIMs, the loss of these affects the stability of LIM1 domain. Leu 44 maintains a striking balance between hydrophobic and hydrophilic interactions that get lost due to L44M or L44P substitution. Secondary structure elements (SSE) and intramolecular hydrogen bonds are hallmarks of protein structure stability, and the analysis shows that mutations affect these parameters. Ultimately, though the LIM1 mutational landscape map suggests that L44M substitution is feasible, the side chain sulphur of methionine may affect the inter-residue hydrophobic interactions and torsional constraints. The statistical parameters such as RMSD, RMSF and ROG showed that L44 substitution has a localised effect and doesn't lead to misfolding of the protein structure (Supplementary Information Figs. S4-S6, Table S2). The results indicate that RMSD in all trajectories (WT, L44P and L44M) stabilise after 70 ns. In addition, WT, L44P and L44M demonstrated similar RMSF fluctuations indicating that these substitutions exert their effect in the neighbourhood and not in the whole protein. Taken together, these results suggest that LIM1 domain does not undergo any unfolding through the simulation period. However, there is loss of secondary structural conformation and essential hydrophobic interactions that occur in the neighbourhood of Leu 44 as a consequence of mutation to either Pro or Met.
CSRP3 sequence conservation and ancestry. CSRP3 sequence orthologs complements the mutational map. Earlier stability result indicates that LIM1 and LIM2 domain have differential mutational tolerance. We carried out sequence conservation analysis in eukaryotes to specify LIM1 and LIM2 domains in CSRP3 protein. Orthologs of human CSRP3 protein from 30 representative eukaryotes were downloaded from NCBI and subjected to multiple sequence alignment (MSA) for sequence conservation analysis. These eukaryotes varied from fishes, amphibians, birds to primates (Supplementary Information Table S3). Alignment of these sequences highlighted the strongly conserved nature of the CSRP3 protein (Fig. 5A). The conserved positions have also been mapped on the three-dimensional structure of LIM1 and LIM2 (Fig. 5B). Mutational map and sequence map decipher structurally critical residues in the LIM domains. The majority of the residue substituted in eukaryotes CSRP3 LIMs were neutral on the mutational landscape (see Figs. 2A,B, 5A). In total, 31/48 (64.58%) substitutions were neutral, 14/48 (29.16%) belong to the destabilising category, and 3/48 (6.25%) of stabilising category. MSA analysis revealed that many of these substitutions were observed in fishes' taxa, namely K15R, I23M, S54A, R64K, P121S, K125Q, G136A, T143S, A148L, V170A and K174R. Interesting, many substitutions like K138Q, E155D, D161E and N175S were exclusive to Western-clawed frog. Importantly, Leucine was conserved in all sequences at position 44, indicating its importance in the LIM domain architecture. This analysis identified that many of the substitutions were localised to a particular taxon, therefore to assess evolution of CSRP3, we performed phylogeny analysis on the named sequences. www.nature.com/scientificreports/ CSRP3 evolution follows ancestral timescale. Maximum Likelihood (ML) phylogeny construction of representative eukaryotes revealed that CSRP3 evolved according to the evolutionary timescale in most of them (Fig. 5C). Fishes, birds, marsupials, rodents, placentals, carnivores and primates clustered separately and this signifies that CSRP3 evolved in the evolutionary timescale. However, it should be noted that the Burmese python clustered outside the reptiles' clade (Fig. 5C). This anomaly we believe may have risen due to uneven distribution of sequences across the reptile family. In addition, we separately explored the phylogeny of LIM1 and LIM2 across these representative eukaryotes to see the closeness in these domains ( Supplementary Information Fig. S7). In the phylogeny, LIM1 and LIM2 form two distinct cluster and the separation indicates that LIM1 and LIM2 have evolved distinctly.

Discussion and conclusions
Cardiomyopathies are predominant cardiovascular diseases primarily caused by mutations in sarcomeric proteins that affect muscle contraction-relaxation activity 11,12 . Hypertrophic cardiomyopathy and dilated cardiomyopathy are highly prevalent and estimated to affect 1:500 and 1:250 individuals among the general population 11,12 . Majority of the therapeutic efforts are directed towards providing only a symptomatic relief to the patients through administration of beta-and calcium blockers, blood thinners and heart rhythm drugs 24 . Mutations in CSRP3 protein are reported to cause HCM and DCM diseases 4,7,9,10 . CSRP3 is a scaffolding, dual role mechanosensor protein that shuttles between the nucleus and cytoplasm in cardiac myocytes 1,2 . It has been shown to interact with multiple proteins such as actin-binding proteins, titin-binding proteins and myogenic transcription factors 6-8 . www.nature.com/scientificreports/ Previous studies have shown that mutations in key positions affect CSRP3 protein-protein interactions and causes protein depletion mediated by proteosome action (for example: L44P, C58G and S54R/E55G) which resulted in hypertrophic cardiomyopathy in mice models 4,19 . Majority of the reported mutations lie in the LIM1 domain of CSRP3 protein 16,19,25 . Therefore, it is imperative to conduct a global study of mutational effects across the LIM domains. Seventeen mutations from HGMD database and a published study 16 were selected and 11 of these were predicted to be deleterious. We narrowed down to 11 mutations that were found to be specific to the LIM domains. Mutational landscape mapping of these 11 mutations through FoldX analysis revealed that 2 of these were highly destabilising, 4 were slightly destabilising and 5 were neutral. Out of the two highly destabilising mutations, L44P 18,19 and C58G, we focussed on structural importance of L44 position as this has been overlooked by several groups.  Protein structural stability and its evolutionary sequence conservation though appear distinct but in fact are entangled tightly with each other. While structural stability has biophysical constraints like stability, solubility, aggregation, interactions and function, that dictate evolution of protein, nevertheless it is natural selection that guides mutations, impacting the biophysical properties of protein 26 . A protein's amino acid conservation signifies the importance of residue position and communicates its localised evolution 27 . We performed phylogenetic analysis on orthologous sequences of CSRP3 from representative set of eukaryotic species. It was evident that LIM1 and LIM2 have evolved distinctly as they formed two distinct clusters 13 . Computational methods have accelerated translational and clinical research. We define a rapid in silico protocol that predicts critical residues for structural stability by creating a mutational landscape map for all possible substitution in the protein structure of interest. This approach adds a new perspective to cardiomyopathy structural study and CSRP3 in particular. Our integrative approach targets critical selective residues from a cohort of clinically important mutations. In the mutational landscape, at position 44, all amino acid substitutions were found to be destabilising except hydrophobic residue L44M. Using consensus from stability analysis, MSA and MD simulations, we demonstrate that leucine 44 is essential for LIM stability, and its substitution by other residues or even similar charge/hydrophobicity is intolerant, thereby distorting torsional angle constraints and intra-domain interactions. Together, topology and sequence conservation coupled with MD analysis can reveal the role of critical residues that are challenging to achieve by experimental methods. It is worth noting that this procedure is limited by the accuracy of FoldX. Protein stability predicted by FoldX is sensitive to the quality of crystal structure used in the analysis. Also, functional mutations are different from structural mutations that are generally missed by stability analysis 28 .
In conclusion, the methodology presented in this study provides a blueprint for targeting essential mutations in proteins that are key for structural stability pertaining to disease phenotypes. For example, L44P was identified as a key position for structural stability in CSRP3 involved in HCM/DCM. L44P substitution alters the packing of the CCHC and CCCC modules of LIM1 by affecting hydrophobic interface. Our analysis shows that Leu 44 hydrophobic interaction with Phe 35 is lost in L44P substitution where Phe 35 was crucial in connect CCHC and CCCC zinc binding motifs. Such a method can be employed as a prerequisite for future research on genetic analysis of important diseases such as Huntington's disease, cystic fibrosis, and other neurodegenerative disorders.

Materials and methods
Statement on data. Authors from this study reporting experiments on human data, human genome data and/or the use of human tissue samples confirm that all experiments were performed in accordance with the relevant guidelines and regulations.

Selection of HCM/DCM point mutations and prediction of deleterious mutations. The Human
Gene Mutation Database (HGMD) 25 database and the mutations reported from an earlier study 16 were used to retrieve cardiomyopathy point mutations observed in the CSRP3 gene. The above cardiomyopathy diseasecausing mutations list was checked whether a point mutation is expected to be benign or damaging. For this, we utilised the Protein Variation Effect Analyzer v1.1 (PROVEAN v1.1) and Polymorphism phenotyping 2 (Poly-phen2) algorithm tools 29,30 . These methods rely on either sequence information or structural information, or both, to predict the functional impact of SNPs.
Sorting tolerant from intolerant. The PROVEAN is a generalised trained computational tool to predict the functional effects of single or multiple amino acid substitutions in protein sequences 29 . The CSRP3 protein sequence, alongside the previously obtained cardiomyopathy mutations, were uploaded to the server, using the default settings. Mutations were assigned neutral or deleterious based on their alignment score value of more or less than − 2.5. PolyPhen 2 (Polymorphism Phenotyping v2) uses eight sequence-based and three structurebased predictive features to predict the impact of protein sequence variants 30 . It utilises naive Bayes posterior probability to provide accurate predictions with three possible outputs, namely-probably damaging, possibly damaging or benign. The CSRP3 protein sequence and cardiomyopathy SNPs were submitted as queries to the polyphen2 server. The outcome levels were qualitatively assigned benign, possibly damaging and probably damaging from the default output.
Structural stability of disease mutations. We used FoldX 4.0 23 for the stability analysis of cardiomyopathy causing mutations in CSRP3 protein. FoldX can estimate the effect of single substitutions on protein structure stability. So far, only LIM domains of CSRP3 have NMR structures available, so we selected the LIM domains (PDB IDs 2O10 and 2O13) for analysis 31 . The first conformer from the NMR ensemble was selected for each domain. RepairPDB command was executed for removing bad torsional angles from the original wide-type (WT) structure. Five replicates were generated for each cardiomyopathy mutation using the BuildModel command. The average ΔΔG of five output model structures was checked for stability analysis. Details of calculating ΔΔG are described in the supplementary file. FoldX values were corrected using the equation ΔΔG FoldX = − 0.078 + 1.14ΔΔ G Experimental formulated by Dan Tawfik group 32 . A negative ΔΔG value implies stabilising mutation, and a positive ΔΔG value depicts destabilising mutations. These were categorised into seven groups based on corrected ΔΔG value as follow: (i) highly stabilising (ΔΔG ≤ − 3 kcal/mol); (ii) stabilising (− 3 kcal/mol < ΔΔG ≤ − 2 kcal/mol); (iii) slightly stabilising (− 2 kcal/mol < ΔΔG ≤ − 1 kcal/mol); (iv) neutral (− 1 kcal/mol < ΔΔG ≤ 1 kcal/mol); (v) slightly destabilising (1 kcal/mol < ΔΔG ≤ 2 kcal/mol); (vi) destabilising (2 kcal/mol < ΔΔG kcal/mol ≤ 3); and were mutated to all other 19 amino acid residues (1083 = 57 residues × 19 possible mutations per residue) using FoldX 4.0 23 . FoldX was preferred to accomplish this high throughput task due to its relatively high accuracy among fast algorithms. The same FoldX procedure as mentioned above was followed in this case too including the mutations stability categorisation protocol. Separately, as a control, a representative set of amino acids in LIM1 domain were self-mutated to rule out the possibility of artifact in this comprehensive virtual mutagenesis protocol.

Molecular dynamics simulations of WT and mutant structures. Protein preparation. The first
conformer of CSRP3 LIM domain from the NMR structure (PDB ID 2O10) was used for structural analysis. Mutant structures (L44P and L44M) were generated from the original structure in the Maestro package (Schrödinger Release 2020: Maestro, Schrödinger, LLC, New York, NY, 2020). All structures (WT and mutants) were minimised at pH 6.8 using PROPKA from Protein Preparation Wizard. Each structure was restrain minimised using the OPLS3e force field 33 .
Protein solvation. Each restrain minimised structure was solvated with the TIP3P solvent system in the System builder from Desmond module of Schrodinger 34 . Orthorhombic box shape was used for boundary conditions having a buffer distance of 10Å and the box volume was minimised. The system was neutralised with either K + or Cl − ions, and additional 150 mM KCL salt was added. OPLS3e force field 33 was assigned for the run. The output generated by the System Builder was used for Molecular Dynamics production run using the built in Molecular Dynamics package.
MD simulations. The default relaxation protocol was followed for the solvated system from the previous step. After relaxation, production MD was executed in NPT constraint parameters using the OPLS3e force field 33 . The default settings of RESPA integrator 35 (2 femtoseconds time step for bonded or near non bonded interactions and six femtoseconds for far non bonded interactions) were incorporated for the simulation. The nose-Hoover thermostat algorithm was used to keep the temperature at 300 K 36 . Similarly, the pressure was kept at 1 bar using method Martyna-Tobias-Klein method 37 . The production MD was simulated for 100 ns in triple replicates.
Correlation analysis. Correlation analysis is a crucial method in MD analysis. This method can provide information about the impact of the amino acid substitution on the protein dynamics and specify which residues are involved in the structural changes and their role. Cross-correlation maps of the residues' motion were used to identify the regions moving in or out of phase in the MD trajectory (also see Supplementary information "Materials and methods" section). An inter-residue cross-correlation based on the number of contacts rather than molecular fluctuations was also analysed. The value of residue pair can vary from − 1 (completely anticorrelated motion) to + 1 (completely correlated motion).
Simulation event analysis. MD trajectories were analysed using simulation interaction diagram (SID) and simulation event analysis (SEA) modules in the Desmond package. The entire duration of simulation time was considered for all analyses. The root mean square distance (RMSD) and root mean structure fluctuation (RMSF) were calculated for each frame for the protein backbone by aligning them to the reference frame (0th frame) in the SID package. In similar way, Radius of Gyration (ROG) was calculated for each trajectory using the SEA package. Solvent accessible surface area of WT, L44P and L44M was obtained for each trajectory. The evolution of the SSE as a function of time along the MD simulation was generated by STRIDE 38 in VMD software 39 .
Sequence analysis and phylogeny of CSRP3 and LIM domains. We used the query term "CSRP3" in the NCBI Gene database (https:// www. ncbi. nlm. nih. gov/ gene/). Carefully selected CSRP3 gene orthologs from different organisms were selected using the Orthologs selection from the annotation pipeline link. The batch Entrez query server (https:// www. ncbi. nlm. nih. gov/ sites/ batch entrez) was utilised to download protein sequences for these gene IDs. The resulting sequences were aligned using MUSCLE 40 in the MEGA-CC software package 41 using the default parameters. MEGA-CC is an automated tool for molecular evolutionary genetics analysis. The generated multiple sequence alignment (MSA) was used to construct a phylogenetic tree and Dayhoff matrix model with Gamma distributed (G) Rates among Sites option was selected. Phylogenetic construction was performed using the Maximum Likelihood (ML) method, and the reliability of a phylogenetic tree was calculated with 1000 bootstrap iterations. The phylogenetic trees (bootstrap consensus) were visualised and imaged using iTOL 42 . We also explored the clustering of LIM domains of CSRP3 in the representative eukaryotes. The same protocol was followed for only LIM specific sequences. Consurf 43 is a popular tool for identifying regions of crucial biological function in proteins by analysing the evolutionary dynamics of amino acid substitutions among homologous sequences. Protein structure, MSA and phylogeny tree files constructed earlier were uploaded to the Consurf server (https:// consu rf. tau. ac. il/). ML methodology was selected for calculating position-specific evolutionary rates in the phylogeny with the Dayhoff model.