Introduction

NaV1.7 is responsible of the propagation of the pain stimuli through the peripheral nervous system. It belongs to the family of Voltage Gated Sodium Channels (VGSCs) proteins expressed both in the prokaryotic and the eukaryotic realms. The most known biophysical activity is carried out by the \(\alpha\)-subunits that, for eukarya, is formed by the three-dimensional rearrangement of a single polypeptide chain of almost 2,000 amino acids. This chain is wrapped in four domains (DI-DIV), each composed of six transmembrane helices, with the first four S1–S4 forming the voltage-sensing domain (VSD), and the last two S5–S6, as well as the extracellular linkers (P-loop) between S5 and S6, forming the selectivity filter (SF) and the central aqueous pore domain (PD), as shown in Fig. 1. The reaction to the change of membrane potential is mediated by helix S4, thanks to the positively charged amino acids arginine (R) and lysine (K) located along the segment.

Figure 1
figure 1

Snapshot of the Nav1.7 wild type. (a) Side view reporting high details of the Domain I: transmembrane helical segments S1, S2, and S3 are colored in green, segment S4 in yellow, segments S5 and S6 constituting the Pore Domain are identified by blue color, and the P-loop between S5 and S6 that helps to form the Selectivity Filter is colored in red. Thick dotted lines labelled as Intra and Extra identify the transmembrane region; (b) top view, Domain I is displayed in full colors as in (a), and the coordination of the four Domains is clearly visible. Here shaded blue, green and orange identify Domains II, III, and IV, respectively.

Note that the positive charges are wrapped and stabilized by amino acids present in the helices from S1 to S31. Depolarization of the membrane is believed to produce a motion toward the extracellular side of S4 segments of each domain, and this motion is transferred to the pore domain via intracellular linkers between the segments S4 and S5. The conformational change results in the opening of the channel pore. The segments S5 and S6 are scaffold of the pore channel, while the extracellular linkers are responsible for the selectivity filter to Na ions that is highly conserved and composed of aspartate (D) in DI, glutamate (E) in DII, lysine (K) in DIII, and alanine (A) in DIV forming a ring, which is the geometrically narrowest region of the ion pore2, see Fig. 3. In mammals, there are nine different isoforms of \(\alpha\)-subunits, Nav1.1–Nav1.9, and their ratio of expression changes among different type of tissues. The isoform NaV1.7 we consider, which is encoded by SCN9A, is highly expressed in nociceptors, where it has a pivotal role in subthreshold membrane depolatization. Here, single aminoacid substitutions in patients diagnosed with inherited erythromelalgia (IEM), paroxysmal extreme pain disorder (PEPD), small fiber neuropathy (SFN) and painful diabetic neurophaty (PDN)3 were found to induce a gain-of-function of the channel.

A significant challenge that is posed to the above scenario is given by the large number of possible case studies, as well as by the difficulties involved in performing electrophysiology studies at large scales. On the other hand, technologies such as artificial intelligence can be applied to gain insights from multiple data sources and rationalize them to improve clinical decision-making ability and improve patient outcomes. This is particularly true in neuroscience where recent findings4 have show the power of these tools both as classificators and predictors, thus envisaging the use of computational pipelines in workflows fully integrated into clinical applications. Indeed, it was recently suggested5 that the available computerized predictive algorithms “are not infallible” and a more holistic approach was in order.

Building upon past work by our group6, in this study we provide one such computational pipeline combining homology modeling7, Residue Interaction Networks (RINs)8,9 and machine learning techniques10,11, to determine whether there is a common structural pattern linking gain-of-function pathogenic mutations to each other.

Starting from a given template, we identify a set of sequences representative of some known gain-of-function pathogenic mutations (PAT), as well as other neutral (NEUTRAL) variants, and use homology modeling to generate the corresponding three dimensional structures. In order to make the relative comparison of these structures feasible, the information of the topology of each structure is then used to map them into their corresponding graphs using RINs, thus reducing the dimensionality of the problem and hence its complexity. This step then allows for a pairwise comparison of all the obtained networks via graph kernel techniques, and for classifying them through machine learning techniques that are able to identify common structural patterns. Note that here NEUTRAL refers to genetic variants selected either from annotated sequence substitutions not known to affect a protein’s function and possibly contribute to genetic disease or from pseudo mutations between orthologous NaV1.7 proteins in closely related mammals.

Three different templates are presented as case studies. The first one is the same template used in past work Ref.6 and it will be used as a benchmark for our analysis. The other two templates derive from two other homologous proteins with greater identities with the original sequence. All three templates depict the sodium channel protein in its closed state and have different sequence identities. The comparative results from these three different case studies will provide useful information on the general potentialities of the proposed workflow, the optimal condition for its use, as well as the limiting factors that need to be addressed to make it an useful tool in clinical applications.

In short, the main aim of our computational workflow is to predict whether a mutation is expected to be deleterious and be able to be distinguished from all others. We then benchmark our method by applying it to the 85 SCN9A mutations reported in Fig. 4a and see whether it is able to classify PAT mutations and discriminate them from NEUTRAL ones.

It is important to emphasize that the present work considers only gain-of-function mutations and relies upon the critical assumption that the functional impact of a point mutation is caused by changes in residue-residue interaction network only, while several additional effects can be induced by point mutation and affect the functionality of the channel. Indeed ion-channels function is determined by many factors besides the shift in gating kinetics. At least 30 mutations have been described in scientific literature for impairing the Nav1.7 function12, either by haploinsufficiency effect or by hindering the selective permeability of the pore. However, these kinds of null-mutations result in a lower quantity of functional Nav1.7 on the membrane, and are associated with Congenital Insensitivity to Pain (CIP) a disease characterized by pain signaling defects13,14.

Results and discussion

Our computational pipeline is sketched in Fig. 2 in the specific case of human protein Nav1.7 as in Ref.6, the NCBI sequence NP_002968.1 has been chosen as the Wild Type (WT) sequence for our study. It is classified as the first isoform of the transcript variant 1, identification code NM_002977.3 for the chromosomal sequence15. It is composed by four domains DI-DIV as pictured in Fig. 3 (top), located at positions DI (112–410), DII (715–978), DIII (1169–1477), DIV (1488–1784). Fig. 3 (bottom) provides its pictorial representation along the sequence, and Fig. 1 its snapshot both as side (a) and top (b) views.

Figure 2
figure 2

The computational pipeline. Starting from a template and 85 genetic variants, homology models are used to identify the corresponding three-dimensional structures, followed by energy minimization and quality assessment to refine them. RINs are then implemented to map them into their representing graphs and machine learning techniques are used to analyze them and identify patterns.

Figure 3
figure 3

(Top) Primary structure and domains positions; (Bottom) Schematic illustration of the poly-peptide chain structure and localization of pathogenic mutations (PAT) associated with pain conditions (left) and not pathogenic variants (NEUTRAL) (right). Colored numbers (left) highlight the four different pathologies, color-coded according to the list shown in Fig. 4.

We then create 85 copies of the WT sequence each one with a single amino-acid mutation, divided into pathogenic (referred to as PAT in the following) and non-pathogenic (henceforth NEUTRAL) variants. The first 31 PAT mutations have been proved to be associated with gain-of-function of the protein by cell electrophysiology assay, see Fig. 4a. The additional control group of 54 NEUTRAL mutations is composed by 4 variants not causing biophysical abnormalities (nABN), 17 genetic variants from dbSNPs with uncertain significance or benign that do not alter the biophysical properties of the channel, and 33 pseudo mutations between orthologous proteins in closely related species missense variants already considered in Ref.6 and identified between SCN9a homologous genes from mammalian species sharing > 90% nucleotide sequence identity, as commonly accepted in the community16,17,18,19. See Tables S1 and S2 of the Supplementary Material.

Among the gain-of-function PAT mutations, there are 18 causing Inerithed Erythromelalgia (IEM), 6 associated with Small Fibre Neurophathy (SFN), 6 causing Paroxysmal Extreme Pain Disorder (PEPD) and 1 related with painful neuropathy in diabetic patients (PDN). The list of all the considered genetic variants is shown in Fig. 4a, where the first 31 pathogenic are those involved in the represented disease, color-coded accordingly20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58. As an example, Fig. 4b,c display a blow up of the three-dimensional structure on the I136V mutation, where the isoleucine in position 136 is mutated into a valine residue.

Figure 4
figure 4

[Left] (a) PAT and NEUTRAL genetic variants. PAT mutations are further divided by disease and highlighted with different colors according to Fig. 3. Among the NEUTRAL variants, the 21 known human variants are highlighted in bold and the 4 not causing biophysical abnormalities are also starred. [Right] Mutation I136V: the initial Isoleucine (b) in position 136 is turned into a Valine (c).

Three different template structures, describing the closed state of the sodium channel, have been selected. We have used Clustal Omega59 to assess the sequence identity between each template and the original WT. The NavAb is derived from the cryo-EM structure of the ortholog protein of Acrobacter butzleri (PDB code 3RVY) , reporting the channel captured in a closed-pore conformation with four activated voltage sensors at 2.7 Åresolution, and modified as described in Ref.6. We shall refer to it as MOESM360 hereafter. The sequence identity of the Acrobacter butzleri sequence with the original WT is \(27\%\) while the derived MOESM3 template increases the identity to \(50.8\%\). The second template NavPaS has been obtained from the eukaryotic American Periplaneta with the spider neurotoxin Dc1a, via cryo-electron microscopy (cryo-EM)(PDB code 6A90)61 (hereafter 6A90). Comparisons between ligand-free protein and protein-neurotoxin do not show any conformational differences, except for the VSD2. The template, therefore, refers to the closed conformation with the VDS2 in the ’up’ conformation. This template shows a global sequence identity of approximately 32\(\%\) compared to the WT sequence. Finally, the last selected template hNav1.7 is derived from the eukaryotic protein NaV1.7 of Homo Sapiens via cryo-EM (PDB code 6J8J)62 and will be referred to as 6J8J. It represents an inactivated state with all four Voltage Sensitive domains in the “up” conformation and the intracellular gate closed, meaning that the channel is closed and not available for activation. The alignments of the WT sequence and this template is \(97\%\). The detailed alignments of all used templates are reported in the Supplementary Material.

When the primary sequence and the function of the protein are known, we use Direct Coupling Analysis (DCA)63, homology modeling7, and energy minimization to obtain the three dimensional structure of the mutated sequence. DCA provides an assessment on the quality of a given template and hinges upon the idea that the mutation of any amino acid is constrained by the need of conserving protein function, so that only mutations conserving the original physical properties are allowed. Then, homology modeling allows the transfer of a protein’s quaternary structure from one sequence to another one by homology. The final step is energy minimization to remove possible steric clashes and inconsistencies that might arise during this process . We have used a in-house implementation of DCA, Swiss-Model64,65 for the homology modelling and FG-MD66 for energy minimization. Once the structures have been generated we use QMEANBrane67 and RAMPAGE68 to evaluate the quality of the obtained structure in terms of their Ramachandran angles \(\Phi -\Psi\) along the peptide sequence. Note that in general, this is known to be an important step that cannot be bypassed69. See section “Methods” for details on the above techniques.

After the quality assessment, a set of 85 three-dimensional structures, representing the 85 genetic variants reported in Fig. 4, are available and need to be compared with each other to test for differences. This is in general a very challenging task when carried out in a full fledged three-dimensional representation, where the comparison is carried out residue by residue. A possible way around is to map each structure into its graph representation. In this way we clearly loose the three-dimensional information on the spatial position of each amino acid, but we can still compare the relative topologies that is crucial to identify common patterns. This step can be done via the Residue Interaction Network (RIN) technique, where nodes represent amino acids and edges represent their non-covalent interactions (see last step at the top of Fig. 2). Here, we use RING 2.08,9 that is able to distinguish six different types of interactions: Hydrogen bond, Van der Waals, Ionic bridges, \(\pi -\pi\) Stacking, \(\pi\)-cation and Disulfide bridges. See section “Methods” for more details.

The final step of the computational pipeline is to compare and classify the obtained networks, as shown in the lower part of Fig. 2. To accomplish these tasks we employ Graph Kernels10 to perform pairwise RINs comparison, and then both unsupervised11 and supervised70 machine learning to identify common patterns. We used two different kernels to compare RINs, the Vertex Histogram (VH) and Weisfeiler–Lehman (WL) subtree kernels, implemented by the GraKel Python library71. Both of them are based on node labels comparison, where node labels are set as the position of each node—i.e. amino acid, in the protein sequence. The essential difference between the two methods is that the VH method makes a pairwise comparison of the nodes labels only, whereas the WL method compares also the corresponding subtrees, hence providing a more complete and general contrast.

As for the unsupervised learning, we consider the Dominant Set clustering method11 where a set of elements are organized in clusters, in such a way that each cluster satisfies an internal homogeneity and expresses an external inhomogeneity. We also apply the Average-Linkage hierarchical clustering72. Supervised learning has also been considered using the Support Vector Machine (SVM) method70 implemented in the Scikit learn library73.

Details on all the above calculations can be found in section “Methods” and in the Supplementary Material.

In short, our computational pipeline (Fig. 2) includes the following steps.

Step 1:

Reconstruction of the three dimensional structures of the mutated sequences via DCA, homology modelling, and energy minimization. See Figs. 3 and 4;

Step 2:

Quality assessment of the obtained structures. See Figs. 5 and 6;

Step 3:

Mapping of each three-dimensional structure into a network, comparison between networks via Graph Kernels techniques, and identification of pathogenic mutations via machine learning techniques. See Fig. 7.

We stress the particular importance of obtaining high quality structures from the homology modeling, because any drawbacks at this stage could affect all downstream analyses, affecting network representations and making comparison and classification questionable. As we shall see below, however, this step is quite delicate. Indeed other possible templates, in addition to those reported here, could have been selected in principle. For instance, Huang and collaborators69 used the recently determined74,75 atomic structure of the rabbit voltage-gated calcium channel Cav1.1 as template for Nav1.7 studies. Moreover, other human templates have also been proposed recently76. However, as detailed in the Supplementary Material (see Figures S7 and S8), it turns out that they could not be used in our case for either low resolution in the original crystal or for a limited extension compared with our WT.

While the reliability of the MOESM3 has already been assessed6, those of 6A90 and 6J8J have not. As a preliminary step, we then first performed a DCA analysis of the 6A90 and 6J8J templates whose predictions are shown in Fig. 5.

Figure 5
figure 5

DCA analysis for templates 6A90 and 6J8J where green points are representing the contact map of the original template and the red points the corresponding DCA representations. (a) and (b) full sequence; (c) and (d): interval 1181–1851 .

In particular, Fig. 5 displays the comparison of the contact maps for the 6A90 and 6J8J templates (green points) with the corresponding DCA predictions (red points). Fig. 5a,b show this comparison for the full sequence of the 6A90 and the 6J8J, respectively. Here, the alignment contains a relatively small number of residues and quite large fraction of gaps, and the predicted contact map is relatively noisy. On the other hand, if we restrict the analysis to the region of the alignment for which the fraction of gaps is less than 30%, which mainly correspond to the region from residue 1181 to residue 1851, then the predictions display a good agreement with the deposited structures 6A90 and 6J8J as illustrated in Fig. 5c,d. Several reasons could explain the above discrepancy between the experimental deposited structure and DCA predictions. Firstly, the predicted rearrangement could be due to an allosteric conformation of the protein different from the one seen in the experiments. Secondly, these interactions could be important for a large fraction of the sequences in our alignment, but not necessarily for the crystallized protein. Finally, the discrepancy could be due to crystal artifacts, not uncommon for membrane proteins.

Bearing this information in mind, we next proceed to the quality analysis of the WT template 6A90 with QMEANBrane and RAMPAGE, as shown in Fig. 6. The quality analysis for the other templates has also been performed and is reported in the Supplementary Material, see Figures S1 and S2. Moreover, for each template and each point mutation we reported the quality value resulting from the QMEANBrane analysis, see Tables S3, S4, S5, S6, S7, S8 S9 and S10. It is worth noticing that the analysis in Fig. 6 is mainly relevant for the protein segments corresponding to the transmembrane regions, i.e. all the \(\alpha\)-helices (see Fig. 1), because this is the critical region as previously discussed.

Figure 6a shows the predicted local similarity to the target structure, and highlights how the produced models have high quality within each domain area, and low quality in the inter-domains area, where the reliability of the models is significantly lower. However, this is completely acceptable as inter-domains region are formed by unstructured loops outside the membrane area (see Figs. 1 and 4), whereas all the considered mutations fall within the high-quality domain regions. As it will be further elaborated below, the low-quality loop regions enter in the network analysis without however jeopardizing the soundness of the results for the important high-quality domains region.

Additional insights can be obtained by performing a Ramachandran plot analysis, to assess whether the obtained structures obey all characteristic constraints provided by both steric hindrance and quantum chemistry. This is done in Fig. 6b, and shows that indeed 97.9% of the produced amino acids fall within a favorable region, 2% fall within a permitted region and only one amino acid falls in the forbidden region (the white region).

Figure 6
figure 6

Quality assessments for the 6A90 Wild Type. (a) Predicted local similarity along the sequence: regions with high similarity are circled and correspond to the domains are DI-DIV; (b) Ramachandran plots where each point represents an amino acid and where characteristic regions are highlighted in different colours. Regions colored in white are considered forbidden.

Armed by all these information, we now consider the issue of comparing the obtained different structures. As anticipated, our strategy hinged upon a mapping of each three-dimensional structure into a corresponding network (univocally identified by its topology) and then compare different networks with each other. We use RIN analysis to perform this mapping and both Vertex Histogram (VH) kernel and Weisfeiler–Lehman (WL) subtree kernel with 5 iterations, to compare them, as illustrated in section “Methods”. The similarity between all considered graphs can be visually assessed via the similarity matrix rendering reported in Fig. 7 for all three MOESM3, 6A90 and 6J8J templates. Each depicted similarity matrix of Fig. 7a–f (left panel) has rows and columns labeled according the mutation ids 0–84 of Fig. 4. Accordingly, 0–30 correspond to PAT mutations while 30–84 are NEUTRAL variants. Each cell (ij) in a matrix shows the similarity value between the i-th and j-th RINs color-coded so that lighter colors correspond to RINs with high degree of similarity (from yellow to blue). Clearly the main diagonal shows always the lightest color, being the result of the comparison of a graph with itself.

Figure 7
figure 7

[Left] Similarity matrices of the Weisfeiler–Lehman (5 iterations) (a)–(c)–(e) and Vertex Histogram (b)–(d)–(f) kernels applied to RINs resulting from MOESM3, 6A90 and 6J8J templates. [Right] Dominant set classification for the WL similarity matrix (c) of 6A90 template: first (g) and second (h) iterations.

Three main points emerge from the analysis of the results of Fig. 7a–f:

  • In both the MOEMS3 and 6A90 templates two well-defined clusters are clearly visible. The first one grouping together PAT mutations (ids 0–29) that have a high degree of similarity within each other and a low degree of similarity with NEUTRAL mutations (ids 30–84). Likewise NEUTRAL mutations are similar to each others but different from PAT ones.

  • No such division in clusters is visible for the 6J8J template, where there is no clear distinction between PAT and NEUTRAL mutations. Although a few genetic variants appear to display a more marked difference, the majority show a high degree of similarity with each other. The reasons for this marked contrast with what has been observed for templates MOESM3 and 6A90 is unclear. One possible explanation stems from the observation that 6J8J is a human protein template representing the native state of the NaV1.7 sodium channel in closed state. The considered WT sequence of NaV1.7 and the 6J8J primary sequence are then nearly identical likely for all the 85 considered genetic variants (each variant differ from the WT for just one amino acid). Hence reconstructed models via homology modeling and energy minimization will also provide nearly identical RINs whose differences can hardly be captured by any pattern recognition algorithm such as that employed here. This is not the case of the two other templates MOESM3 and 6A90 where generated three dimensional structures via homology models might show a significant structural difference with one another so that energy minimization might drive them into different local minima. See the Supplementary Material, Figure S3.

  • For all the three considered templates, the WL and VH kernels show a similar pattern. This means that the pairwise comparison of the nodes labels alone, as done by the VH kernel, is already sufficient to capture the main features. The WL kernel performs a more general comparison since also the subtree structure of each node is considered. Hence, while a pairwise comparison of the nodes labels is sufficient to discriminate the PAT and NEUTRAL classes, the subtree structure and labels is also found to be coherent with this classification;

We also examined the role of each interaction (H-bond, Van der Waals, and Ionic) separately and found the same pattern observed in the whole RINs. All these results can be found in the Supplementary Material, see Figures S4S5.

Further support to the above results can be obtained using unsupervised machine learning techniques. To this aim we apply the Dominant Set (DS) algorithm11 to the above similarity matrices. The results for matrix (c) of the 6A90 template are reported the right panel of Fig. 7. The diagrams (g) and (h) are composed of 85 rows labeled with the ids of considered genetic variants and two columns. In the first column (labeled correct clusters) the first 31 PAT mutations are color-coded in black, and the remaining NEUTRAL variants are color-coded in yellow. The second column (labeled DS clusters) reports the classification according to the DS algorithm. Diagram (g) is the result of one iteration of the DS algorithm. Remarkably, the algorithm captures the correct classification of 96.8% of the pain related PAT mutations, color-coded in black. Only one pain mutation (T1596I, id 30) is not correctly classified. This result is not surprising: looking at matrix (c) it is evident that mutation 30 is very different from the ones in the PAT group and instead very similar to the ones in the NEUTRAL group. Among the NEUTRAL variants, color-coded in yellow, 13 variants are not classified correctly, 4 of which are human variants. An additional iteration of the DS algorithm within the variants originally classified in the black group further splits it in two clusters as shown in diagram (g): the first cluster, shown in blue, contains the 13 neutral variants plus the deleterious mutation (A863P, id 11) and the second one, shown in black, contains all the other deleterious mutations. Note that the lower degree of similarity of mutation (A863P, id 11) and the rest of the PAT group is also visually evident from the similarity matrix (c) and can be due to a lower quality of the structure model.

Consistent findings are found by applying the average-linkage hierarchical clustering72 to the distance matrix derived from matrix (c), as presented in Fig. 8. Here, the leaves of the dendrogram are labeled with the ids of the considered genetic variants and the scale on the left shows the distance among clusters. Note that the PAT mutations are clearly separated from the NEUTRAL variants, as for the DS algorithm, but mutations 11 and 30 are further singled out, as suggested by the second iteration of the DS algorithm, diagram (g).

Figure 8
figure 8

Average-linkage hierarchical clustering of the 6A90 template. The leaves of the dendrogram are labeled with the ids of the considered genetic variants. The scale on the left shows the distance among variants. The input distance matrix has been derived from the WL kernel similarity matrix of the 6A90 template. The big red box highlights the cluster of the pain PAT mutations. The small red boxes highlight mutations ids 11 and 30.

The clustering results for templates MOESM3 and 6J8J can be found in the Supplementary Material, see Figure S6.

We also applied supervised machine learning techniques in the form of the Support Vector Machine (SVM) method with cross-validation to determine its prediction ability. Cross-validation is a standard technique used to evaluate the ability of prediction models, see section “Methods” for a detailed description. It is particularly useful when a small dataset is available, as in our case. The accuracy of the predictions made for the three considered templates are shown in Table 1. As visible, the WL kernel predictions outperforms VH for all templates and the accuracy results are in line with the kernels results: the ability of the VH and WL kernels to discriminate between pain related mutations and non pain ones is reflected also by the SVM learning method. In particular, the prediction accuracy is very good for templates MOESM3 and 6A90 and rather poor for 6J8J, in agreement with previous findings.

A word of caution is in order here. At first sight, it might appear striking the contradiction between the high sequence identity of the 6J8J hNav1.7 human template with the WT and the corresponding low accuracy of the prediction. However, this is in line with past work, as anticipated earlier, and can be ascribed to the difficulties of homology modelling when comparing sequences with too high identity (see e.g. Ref.77), whereas performs better when comparing different species. A good example is provided by a recent study by Huang et al69 where the atomic structure of the rabbit voltage-gated calcium channel Cav1.1 provides a very good template for homology-based structural modeling of Nav channels notwithstanding a comparatively low value (\(21\%\)) of sequence identity. This is more true within our computational pipeline because we are performing a pairwise comparison of structures to identify emerging global patterns. Additional human templates with lower sequence identity have also been recently proposed76, but the corresponding reduced identity can there be ascribed to a significantly lower quality of the original deposited structure and to the much lower number of represented amino acids, as detailed in the Supplementary Material.

Table 1 Prediction accuracy of the SVM method for the three considered templates.

The general scenario emerging from previous analysis strongly suggests the existence of some critical mutations that lead to deleterious effect. This is an important issue with far reaching consequences in the case of human genome78. Even at the level of globular proteins, it is well-known how single mutations of specific amino acids can change folds and functions in a very controlled way79,80. However, this case appears to be more subtle. For instance, it is interesting to note that M1532I is a PAT genetic variant and M1532V is a NEUTRAL genetic variant notwithstanding the fact that Isoleucine (I) and Valine (V) are both hydrophobic amino acids with quite similar chemical structure. On this basis, both M1532I and M1532V were expected to behave similarly. Yet, there is significant evidence in the literature for the former to be a pathogenetic variant81, whereas M1532V is reported in dbSNP82, lacking information about the association to a clinical condition and about minor allele frequency. The case of these two closely localized variants with a different predicted impact, points out the importance to implement this study with new variants in order to highlight even very subtle differences between residues, which can discriminate between high-impact and neutral variants. While this particular case is particularly striking, it turns out to be not the only one (see e.g. Huang et al69). In fact, we did not find any well-defined correlation between the chemical physical properties of the mutated amino acids and their putative pathogenic impact.

We can build on this idea in our case by presenting two different analyses that provide concurring results supporting the above predicted scenario.

The first analysis is prompted by the ability of the VH kernel to discern between PAT and NEUTRAL mutations, and hinges on the node frequency analysis reported in section “Methods”. On recalling that each node represents an amino acid of the protein sequence, labeled by its position in the sequence itself, and that the VH kernel is based on nodes labels comparison, we compute the frequency of each node label within the PAT and the NEUTRAL groups. Consider a node with label l (i.e. the amino acid in position l in the protein sequence), we then look to all RINs associated to PAT group—that can originate from any mutation of a specific node \(l^{\prime }=1,\ldots 2000\) including \(l^{\prime }=l\), and count the number of times that that l is involved in the RINs of that group. A frequency 1 then means that l has non-covalent bonds in all mutations belonging to the PAT group, with a frequency 0 indicating that l has no bonds except the covalent ones. The same analysis can then clearly be performed for the NEUTRAL group, with similar interpretation.

Figure 9
figure 9

Relative frequency of nodes for pathogenic (PAT) and non pathogenic (NEUTRAL) mutations. Vertical dotted lines identify regions of different behaviour between the two groups, color-coded according to their specificity (see text).

Figure 9 shows the frequency of nodes along the protein sequence for the PAT and NEUTRAL groups. The range of the four domains are highlighted in orange along the sequence axis and the dashed lines identify the intervals exhibiting a frequency variation in one of the two groups and not in the other. Note that the majority of nodes have frequency one in both groups, meaning that all the corresponding RINs are very similar for large parts of the protein sequence. For the PAT group, this includes almost entirely the regions involved in the formation of the four domains, except for the terminal traits of D II and D IV. When not with frequency one, the PAT and NEUTRAL groups appear to be characterized by two substantially different patterns: the PAT group shows two broad bands located in the regions between 400–720 and 950–1150; while the NEUTRAL group is characterized by having three more picked bands: 400–610, 850–1000, and from 1700 onwards. Both these bands are highlighted with thick dotted lines in Fig. 9, red for PAT group, green for the NEUTRAL group.

Figure 10
figure 10

Entropy profile of the full alignment: red dots highlight PAT mutations and blue dots the NEUTRAL ones. The horizontal line is the mean entropy value of the alignment in the mutated sites.

The second analysis is based on the calculation of the entropy profile that can be obtained by the MSA calculation performed within the DCA approach, as outlined in section “Methods”. The main idea is that a low-entropy is characteristic of conserved position, so we expect low entropy values for PAT mutations and high entropy values for NEUTRAL mutations. Note that ’low’ and ’high’ are here defined with respect to the average value of the entropy in the mutated sites. Note that here, unlike the node frequency analysis, the entropy value refers to the l-th node (amino acid) that has been muted.

As Fig. 10 shows, only four PAT mutations occurs in amino acids with higher-than-average entropy, so the overwhelming majority of these mutations occur in conserved regions with low entropy. By contrast, most of the NEUTRAL mutations occurs in higher-than-average entropy, indicating that those mutations occur in regions with high mutation propensity. Not surprisingly, PAT mutations with low entropy bear mutations of residues with indices located in DI (\(\approx 250\)), DII (\(\approx 800\)), DIII (\(\approx 1300\)), and DIV (\(\approx 1700\)) domains, in agreement with the intuitive fact that evolution has optimized sequences that try to avoid PAT mutations. When contrasted with the analysis of Fig. 9 we note that the second and the forth of the above regions are again located in the VSD domain, but the other two are not.

It would be interesting to pursue a better characterization of those mutations that do not follow this simple rule. However, a much larger dataset than that available at the present time would be needed to properly discuss the pathogenicity of some particular mutation. Ideally a detailed characterization of the effect of all possible mutations for that particular wild type residue would be desirable.

Conclusions

In short, we have implemented a computational pipeline to analyze the effects of a single residue mutation in human NaV1.7 channel. Our aim was to identify emerging patterns associated with gain-of-function mutations that lead to diseases by impairment of protein activity. The general workflow includes the following steps. Start with a specific template, perform a preliminary test via DCA analysis for its reliability, and then use homology modeling and energy minimization to find all different three-dimensional structures associated with each of the single amino acid variants; quality assessment of the obtained structures; a mapping of the three-dimensional structures into the corresponding topological network via RIN creation; machine learning of the similarity index of different RINs to identify emerging patterns.

We have applied this workflow to the specific case of different mutations of protein NaV1.7 that is involved in the propagation of nociception signals. We examined 31 gain-of-function mutations (PAT) that have been shown to cause a functional impairment of the channel demonstrated by cell electrophysiology assay and 21 benign or with uncertain significance variants from dbSNPs and 33 pseudo mutations identified among SCN9A homologous genes16,17,18,19 from mammalian species sharing >90% nucleotide sequence identity. In the context of inter-species differences, it important to remark that wild-type (normal) rodent (rat) and human Nav1.8 display substantial biophysical differences, and these make DRG neurons carrying the human Nav1.8 more excitable than DRG neurons carrying the rat channel83. Compared to rat, the human wild-type Nav1.8 could be judged to be “pathogenic”. By contrast, the application of our computational pipeline to the set of the selected 85 mutations for a specific template, unambiguously identified the class of PAT mutations as deleterious mutations and clearly distinguished them from the NEUTRAL ones. Comparing the patterns of each structure relative to the considered mutations, we were able to recognize those mutations having the same physiological meaning and sharing a common pattern, supporting the reliability of our computational pipeline as a predictive tool for deleterious mutations.

While the starting point was the MOESM3 template in closed state as in past work by some of us6, we extended it in many and different aspects. Firstly, we took full advantage of the arsenal of the techniques recently implemented in the realm of artificial intelligence. This includes a full rather than point-like analysis of RINs, as well as both unsupervised and supervised machine learning analysis. Secondly, the specific analysis was expanded by including 2 additional templates, the 6A90 and the 6J8J also for the closed state of the channel. Interestingly, we found a prediction accuracy of the order of \(95\%\) for NavAb MOESM3 that has a \(51\%\) sequence identity with the original WT, a prediction accuracy of the order of \(97\%\) for NavPaS 6A90 that has a \(31\%\) sequence identity with the original WT, and finally a prediction accuracy of \(63\%\) for hNav1.7 6J8J template that has a sequence identity of \(97\%\) with the original WT. This is in line with past studies69 and shows that a carefully selected template with a non-human template with intermediate sequence identity is the optimal choice for the present computational pipeline. More generally, our findings indicate that proposed computational pipeline might be accurate and precise also in other similar cases, thus paving the way toward the possibility of using this tool as a preliminary analysis to predict the pathogenicity of SCN9A mutations and better address candidate variants to cell electrophysiology confirmatory assay. Within this framework, it is worth emphasizing that human pain processing is very complex, involving higher levels of processing than in these lower species such as bacteria. As a result of that, the human channel is so finely tuned that perhaps, even the most sophisticated computational predictions fail to capture subtle and small but functionally important changes5,83.

In perspective, there are a number of shortcomings that are clearly emerging from our analysis. Firstly, the lack of a robust and extended dataset of human variants that have been found to be not related to any pathology that forced the use of different types of variants. Secondly, the variable resolution of the experimental deposited structures, that requires a preliminary analysis of the template in order to pursue a reliable analysis. This was clearly highlighted by our DCA analysis, which showed how all three used templates presented a significant number of gaps (i.e. missing residues). Finally, the use of so many and so different techniques, while instructive, may be cumbersome when applied to a large number of different inputs. A simpler and more direct workflow would be clearly desirable.

It would be extremely interesting to be able to extend the present study of the sodium channel protein by analyzing different conformational states besides the closed one, in order to understand if the ability to distinguish PAT mutations is preserved also after a structural change. We hope to be able to address this and other points in a future dedicated study.

In conclusions, this study represents a pilot application of prediction methods to separate deleterious from neutral genetic variants in Nav1.7 sodium channel, supporting the selection of candidates for cell electrophysiology assay.

Methods

Direct coupling analysis (DCA)

Roughly speaking, the idea underlying DCA is that the mutation of any amino acid is constrained by the need of conserving protein function. In particular, a mutation of a residue involved in an important interaction with a second one can only be fixed during evolution if the new mutated residue has similar physical properties to the first one or if the second residue is also mutated in a concerted way. This mechanism gives rise to patterns of correlated mutations that can be detected in a multiple sequence alignment (MSA) of proteins of the same family and used to infer pair of residues that are most likely interacting in the tertiary structure. DCA has been proved to be able to correctly predict with good accuracy the contact map of proteins belonging to sufficiently large protein families.

For this study we used an in-house code84, which compute DCA using the pseudolikelihood approximation85. The inference was performed for only the regions of the alignment for which the fraction of gaps is less than 30%, which mainly correspond to the region from residue 1280 to residue 1840.

In particular, DCA analysis was obtained by performing an initial alignment comprising sequences Q15858(Homo sapiens), Q62205(Mus musculus), O08562(Rattus norvegicus) and Q28644(Oryctolagus cuniculus) from the Uniprot86 database. The alignment has been used as a seed for searching homologous sequences in Uniprot and Metaclust87 databases. An hidden markov model profile of the seed has been built with hmmbuild with default parameters. We searched the profile using hmmsearch with default parameters. We found 45447 matches on Uniprot and 104741 on Metaclust. We concatenated the sequences obtained from the two databases into a single MSA and filtered out all those sequences with more than 50% of gaps. The filtered MSA only contained 6637 sequences. In order to select a region of the protein for which we could obtain a lower fraction of gaps we selected from the full MSA only those columns with less than 30% of gaps, and we filtered out again those sequences with more than 70% gaps. This restricted MSA contains 19045 sequences.

Uniprot database has been downloaded on 2018/11/27, while we used Metaclust release 2018-06 for metagenomic data.

Homology modelling

Homology modeling relies on evolutionary related structures (templates) to generate a structural model of protein of interest (target). For this task we used Swiss-Model65, which is widely known and used in the literature. It uses a sequence S and a template T (in PDB format) as input. The output generated is a tentative three-dimensional (3d) structure (in PDB format) associated with the sequence S. Swiss-model includes also an energy minimization step to resolve small structural distortions, unfavourable interactions or clashes introduced during the modelling process. Besides the 3d structure, also the quality of the obtained model is evaluated.

Energy minimization

The structures produced by homology modeling are often closer to the template than to their native structures. It is therefore necessary to perform an energy minimization step using dedicate tools. To this purpose we employed FG-MD66: it uses a multiple templates approach to reshape the energy landscape from golf-course-like to funnel-like ones and drive the energy minimization closer to native state.

Quality assessment

QMEANBrane67 exploits the increasing availability of deposited high definition membrane protein structures to adapt knowledge-based methods to this class of proteins. It is known that the properties of membrane proteins are strongly influenced by their interaction with phospholipid tails, but a clear division into a membrane region and a soluble region does not adequately reflect the variation in molecular properties along the membrane axis. To capture these differences, QMEANBrane divides the study into three parts: an interface zone consisting of all those residues whose \(C_\alpha\) are at a distance of 5Å from the defined membrane plane; a membrane region enclosed by all those residues that are more than 5Å between the two planes; finally, a region of soluble protein consisting of the remaining amino acids.

Residue interaction networks (RIN)

Given a protein structure, its corresponding RIN is generated by inspecting its atoms and looking for non-covalent bonds between them. We used RING 2.08 to perform this task. Given the PDB representation of a protein, the RING-2.0 algorithm generates the graph in two steps. The first identifies a list of residue-residue pairs eligible to undergo an interaction based merely on distance measurements. The second characterizes every contact by identifying the specific type of interaction, which can be Hydrogen bond, Van der Waals, Ionic bridges, \(\pi\)-\(\pi\)-Stacking,\(\pi\)-cation and Disulfide bridges.

RING 2.0 allows the RIN generation according to various parameters that the user can select. All RINs produced in this study have been generated using the default options. In particular, for the Network policy parameter we used the Closest option, where all atoms of a residue-residue pair are considered to measure the distance; for the Interaction type we used the Multiple option, where RING reports multiple interactions per residue pair but only one interaction per interaction type; for the Distance threshold we used the Strict option, where each type of interaction has a specific threshold depending on its strength, as reported in Table 2.

Table 2 Threshold distance values for the Strict option of RING 2.0.

Graph kernels

Graph kernels can be intuitively understood as functions measuring the similarity of pairs of graphs. We used two different kernels to compare RINs, the Vertex Histrogram (VH) and Weisfeiler–Lehman (WL) subtree kernels, implemented by the GraKel Python library71. Both of them are based on node labels comparison, where node labels are set as the position of each node, i.e. amino acid, in the protein sequence. The essential difference between the two methods is that the VH method makes a pairwise comparison of the nodes labels only, whereas the WL method compares also the corresponding subtrees, hence providing a more complete and general contrast. Formal introduction to graph kernels can be found e.g. in88,89,90. Here follows a brief description for the VH and WL kernels:

  • Vertex Histogram kernel: Given a graph G, its vertex histogram is defined as a vector counting, for each possible node label, the number of nodes in G having exactly that label. Hence, all input graphs on the same set of node labels can be compared through their vertex histrograms and the kernel function summarizes the result of such comparison;

  • Weisfeiler–Lehman subtree kernel: the key idea of the Weisfeiler–Lehman algorithm is to replace the label of each vertex with a multiset label consisting of the original label of the vertex and the sorted set of labels of its neighbors. The resultant multiset is then compressed into a new, short label. Such new label reflects the knowledge of the node and its neighborhood. This relabeling process is then repeated for h iterations. By performing this procedure simultaneously on all input graphs, it follows that two vertices from different graphs will get identical new labels if and only if they have identical multiset labels. The kernel function in this case compare the node labels of the graphs resulting after each iteration and summarizes the comparison with a real number. It can be shown that this is equivalent to comparing the number of shared subtrees between the two input graphs (the kernel considers all subtrees up to height h).

Clustering methods

We apply two different clustering methods: Dominant Set (DS)11 and Average-Linkage72. The DS method tries to answer the problem of organizing a set of elements in clusters, in such a way that each group satisfies an internal homogeneity and expresses an external inhomogeneity between the groups. The algorithm behind this approach does not require any assumption underlying the data representation and does not require prior knowledge on the number of clusters to be obtained, as it is able to determine them in sequence.

The Average Linkage is a well-known clustering method that creates clusters by proceeding in a bottom-up way. Given in input the distance matrix of a set of objects, it starts by assigning to each object its own cluster and then proceeds by joining at each step the two most similar clusters. In Average Linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each object in one cluster to every object in the other cluster. We used the R Package91 implementation of the Average Linkage clustering and displayed the result as a dendrogram where all input objects are shown on the leaves and the scale represents the distance between clusters.

Support vector machine

In the basic configuration, Support Vector Machine (SVM) works with two classes and considers linearly separable problems, that is data points which can be separated by an hyperplane. Considering data as n-dimensional labeled points, SVM searches the hyperplane that separates points of different classes, maximizing the distance between the boundary and data points. For unsupervised learning, model’s prediction accuracy must be evaluated. To this aim we use the cross-validation technique, as implemented in the Scikit learn library73.

Cross-validation is one of the standard techniques to check the prediction ability of the SVM supervised learning method. Roughly speaking, given a dataset of known data (the training set), cross validation divides the dataset into k partitions and, for k times, one partition at a time is used as validation set, while the rest of the dataset remains as training set. At each iteration a value of prediction accuracy is measured and, at the end, the mean value of such measures will be the final prediction accuracy result. In our case the training set consists of (the representation of) the 85 considered genetic variants and we set \(k = 10\), that is, we repeat the partition into validation set and training set 10 times.

Cross-validation is useful to avoid overfitting problems in accuracy evaluation. The boundary found by SVM to divide two classes is maintained by a small subset of points, called support vectors. These points are the closest to the boundary, so adding more data does not affect predictions unless the new data is considerably closer to the boundary than the support vectors. Since we don’t have a large data set, the use of cross-validation helps to perturb the boundary. In fact, for each step, the support vectors are composed of different points. If cross validation produces a good result for accuracy it means that the problem is highly linearly separable and the model has a good level of generalization. Even having a large dataset, using cross-Validation gives more reliable results, that are not strictly related to a particular partition of training and testing datasets.

Node frequency analysis

This analysis takes in consideration RINs nodes. We remind that each node represents an amino acid of the protein sequence, labeled by its position in the sequence itself. The idea is then to compute the frequency of each node with the aim of highlighting the nodes role in pathogenic and non pathogenic mutations.

In particular, we computed the frequency of each node label within the pathogenic mutations (PAT group) and, separately, within the normal variants (NEUTRAL group). For instance, a node label l (i.e. the amino acid in position l in the protein sequence) that is present in all RINs of the pathogenic mutations has frequency one for the PAT group. If the same node label l is also present in all RINs of the normal variants, its frequency is one as well for the NEUTRAL group. By comparing the two frequencies we could argue that l is not a sensible node position w.r.t. the PAT and NEUTRAL genetic variants.

Entropy analysis

The MSA calculated for DCA analysis can be also used to calculate the entropy profile of the protein, with the idea of shedding light on which amino acids, along the chain, are most likely to change during evolution. The entropy profile \(S_i\), of our MSA, has been calculated by first computing the frequency \(f_i(\alpha )\) of each non-gap symbols \(\alpha\) for a given position i, and finally summing \(S_i=-\sum _{\alpha }\;f_i(\alpha )\ln (f_i(\alpha ))\) according with the Shannon formula. A high entropy is indicative of a high mutation propensity, and vice versa low entropy is characteristic of conserved positions.