In silico screening and analysis of nonsynonymous SNPs in human CYP1A2 to assess possible associations with pathogenicity and cancer susceptibility

Cytochrome P450 1A2 (CYP1A2) is one of the main hepatic CYPs involved in metabolism of carcinogens and clinically used drugs. Nonsynonymous single nucleotide polymorphisms (nsSNPs) of this enzyme could affect cancer susceptibility and drug efficiency. Hence, identification of human CYP1A2 pathogenic nsSNPs could be of great importance in personalized medicine and pharmacogenetics. Here, 176 nsSNPs of human CYP1A2 were evaluated using a variety of computational tools, of which 18 nsSNPs were found to be associated with pathogenicity. Further analysis suggested possible association of 9 nsSNPs (G73R, G73W, R108Q, R108W, E168K, E346K, R431W, F432S and R456H) with the risk of hepatocellular carcinoma. Molecular dynamics simulations revealed higher overall flexibility, decreased intramolecular hydrogen bonds and lower content of regular secondary structures for both cancer driver variants G73W and F432S when compared to the wild-type structure. In case of F432S, loss of the conserved hydrogen bond between Arg137 and heme propionate oxygen may affect heme stability and the observed significant rise in fluctuation of the CD loop could modify CYP1A2 interactions with its redox partners. Together, these findings propose CYP1A2 as a possible candidate for hepatocellular carcinoma and provide structural insights into how cancer driver nsSNPs could affect protein structure, heme stability and interaction network.

The human CYP1A2 gene is located on the long arm of chromosome 15 (15q24.1) that spans seven exons. The CYP1A2 protein is exclusively expressed in liver and plays an important role in metabolism of heterocyclic and aromatic amines, caffeine and melatonin [14][15][16][17][18] . This enzyme is also responsible for hepatic metabolism of many clinically used drugs such as tacrine, zolpidem, clozapine, theophylline and so on [19][20][21][22][23] , while the other two enzymes of this family do not have significant role in drug metabolism.
The human CYP1A2 gene encodes a heme-binding protein composed of 516 residues. The three-dimensional (3D) structure of the protein covering residues 34-513 has been determined in complex with the inhibitor α-naphthoflavone, while the structure of the N-terminal transmembrane helical domain is lost in this crystal structure 14 . According to the crystal structure (PDB ID: 2HI4 14 ), CYP1A2 holds fifteen α-helices and five β-sheets 14 . The iron atom of the heme prosthetic group is coordinated by the Cys458 of the protein moiety which belongs to the consensus signature of cytochrome P450 proteins (PROSITE signature PS00086) 24 . In addition, arginine 137 (R137) from C helix is hydrogen bonded to the heme propionate oxygen and further stabilizes its position.
CYPalleles is a web page which was developed to standardize the nomenclature of human cytochrome P450 alleles (http://www.cypal leles .ki.se/ 9 ). It also provides genetic information and the molecular effect of the variants on the enzyme activity. More than 20 alleles have been reported for CYP1A2 gene in CYPalleles, among them, CYP1A2*6 (R431W), CYP1A2*8 (R456H), CYP1A2*11 (F186L), CYP1A2*15 (P42R) and CYP1A2*16 (R377Q) are the most studied alleles [25][26][27][28][29] . Nevertheless, structural or functional consequences of the vast majority of nsSNPs for CYP1A2 recorded by the NCBI dbSNP database have not yet been determined. Since CYP1A2 is one of the main hepatic CYPs involved in the bioactivation of carcinogens and metabolism of clinically used drugs, SNPs of this enzyme could affect cancer susceptibility or drug efficiency. Therefore, the identification and evaluation of CYP1A2 pathogenic nonsynonymous SNPs (nsSNPs) are of major importance. This is also helpful in personalized medicine and optimization of drug treatment to achieve the most efficiency and least side effects. In this study, nsSNPs of CYP1A2 gene were evaluated by computational tools to identify pathogenic nsSNPs. We also performed MD simulation to assess how these nsSNPs affect the protein structure.

Methods
Data collection. The human CYP1A2 protein sequence was obtained from UniProt database 30 (UniProt ID: P05177). SNP data for CYP1A2 gene were retrieved from NCBI dbSNP 31 build 150. All nucleotide positions were related to GRCh37.p13 (hg19) annotation release 105. The three-dimensional structure of the CYP1A2 protein (PDB ID: 2HI4 14 ) was downloaded from Protein Data Bank (https ://www.rcsb.org 32 ).
In silico evaluation of nsSNPs. In silico evaluation of CYP1A2 nsSNPs was performed using a variety of computational tools in a stepwise fashion where the output of each step was served as the input for the next one. SIFT 33 44 and MutPred2 45 are sequence-and structure-based tools which utilize the userprovided sequence information and the self-extracted structural features to predict if SNPs are associated with functional effects.
We categorized the tools into three groups (Table 1). SIFT 33 46 were employed to assess the likelihood that a variant is pathogenic. CHASM-3.1 47 was used to identify possible cancer driver variants. All prediction scores were received directly from their own web servers except for VEST-4 and CHASM-3.1 which  53 . The crystal structure of the CYP1A2 protein (PDB ID: 2HI4 14 ) was used as the starting structure for wild-type (WT) protein after removing the ligand (alpha-naphthoflavone) atomic coordinates. The initial structure of the variant proteins was generated from WT structure using mutate tool of Swiss-Pdb Viewer v4.1.0 54 . The proteins were immersed in a cubic box of TIP3P water molecules. An adequate number of water molecules was replaced by counter ions to neutralize the systems. Each neutralized system was then subjected to steepest descent energy minimization until the maximum force fell below 500 kJ mol −1 nm −1 . In order to equilibrate the solvent and ions around the proteins, two position-restrained MD simulations were carried out. The temperature and pressure of the systems were controlled at 300 K and 1 bar by V-rescale thermostat 55 and Parrinello-Rahman barostat 56 , respectively. After equilibration, each system was subjected to 200 ns (ns) unrestrained MD simulation considering the similar conditions as two previous position-restraint simulations. The LINCS algorithm 57 was used to constrain the bonds with hydrogen atoms and the particle mesh Ewald method 58 was employed for long range electrostatic interactions. The Cut-off distance for the Lennard-Jones, short-range and long-range electrostatic interaction was set to 12 Å. A time step of 2 fs was used for integrating Newton's equations of motion.

Trajectory analysis and visualization.
Most of the trajectory analyses reported in this study were performed by built-in utilities of GROMACS package version 5.0.5 52 . The root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (Rg), and intramolecular hydrogen bonds were analyzed using gmx rms, gmx rmsf, gmx gyrate and gmx hbond of GROMACS package, respectively. The secondary structure content of the proteins was calculated as a function of time using the DSSP program 59 . The principal component analysis (PCA) was conducted using gmx covar and gmx anaeig. To perform free energy landscape (FEL) analysis, all-atom RMSD with respect to the average structure and radius of gyration were initially obtained for the analyzed time frames and then were employed by gmx sham module of GROMACS for calculation of Gibbs free energy as well as construction of FEL. A conformation with minimum free energy was extracted as the representative structure for visualization. The three-dimensional structures of the proteins were visualized using Chimera 1.11 60 . The CaPTURE program 61 was used to explore cation-pi interactions of the snapshots extracted from the MD trajectories.

Results
The SNP dataset. The nsSNPs of human CYP1A2 gene were retrieved from the NCBI dbSNP database 31 build 150. The nsSNPs which met at least one of the following criteria in the validation method were entered to the evaluation: (1) sequenced in 1000Genome project (1000G), (2) validated by multiple independent submissions to the refSNP cluster, (3) validated by frequency or genotype data, (4) genotyped by HapMap project, (5) validated by submitter confirmation, and (6) observed in at least two chromosome apiece. The nsSNPs which have no information on validation method (did not have any of the mentioned criteria) were excluded. Among them, there were four known alleles of CYP1A2 which were listed in CYPalleles including P42R (CYP1A2*15), S212C (CYP1A2*12), R377Q (CYP1A2*16) and N397H (CYP1A2*18). Since these alleles have been frequently studied, we made an exception for these nsSNPs and included them in our analyses. Totally, 176 nsSNPs were prepared for analysis (Supplementary Table S1). More than half of the nsSNPs occured in exon 2 (n = 94) and the others were mapped in exons 3 (n = 10), 4 (n = 8), 5 (n = 17), 6 (n = 14) and 7 (n = 33). The G to A transition is the most frequent nucleotide substitution (29.5%) found among all analyzed variations followed by C to T (23.9%), A to G (6.8%) and C to A (6.8%). At the protein level, the most common amino acids as the reference and missense were Arg (n = 45) and Leu (n = 17), respectively. The replacements of Arg with Trp (n = 12, 6.8%), Gln (n = 12, 6.8%), Cys (n = 7, 4.0%), and His (n = 7, 4.0%) and substitution of Asp by Asn (n = 7, 4.0%) are the most frequent amino acid substitutions (Fig. 1).
In silico evaluation of nsSNPs. As shown in Fig. 2, a total of 176 nsSNPs for human CYP1A2 gene were evaluated in a multi-step framework. A variant must be voted by all of the tools to proceed to the next step of the analysis. Firstly, all nsSNPs were evaluated by SIFT, PROVEAN, MutationAssessor, LRT, FATHMM-MKL, EFIN, CADD, PolyPhen2 and SNAP2 to identify functional nsSNPs. As a result, 38 nsSNPs were agreed to be associated with functional effects by all of the used methods (Supplementary Table S2). Subsequently, the isolated nsSNPs were subjected to pathogenicity evaluation using SuSPect, MutPred2, PMUT, PhD-SNP and  Table 2). The evolutionary conservation profile was calculated for the amino acid position of these pathogenic variants using ConSurf 49,50 web server. The conservation scores calculated by this server range from 1 to 9, and discriminate between highly variable and highly conserved positions, respectively. The results included one position (136)  The filtered pathogenic variants were further analyzed with CHASM-3.1 47 to assess possible association with cancer susceptibility. CHASM-3.1 consists of cancer-specific classifiers which allow to predict cancer driver variants depending on a particular cancer type. Since CYP1A2 is a hepatic enzyme, we selected liver-viral (hepatocellular carcinoma) to compute the cancer driver scores. The results reported in Table 3 revealed a possible association with hepatocellular carcinoma for G73R, G73W, R108Q, R108W, E168K, E346K, R431W, F432S and R456H variants (P-value < 0.05).

Evaluation of nsSNPs occurring in transmembrane helix. The CYP1A2 is a membrane-bound
protein which is anchored to the endoplasmic reticulum membrane through an N-terminal transmembrane helix. However, to date, no complete structure for CYP1A2 including this region has been determined. Hence, the sequence of CYP1A2 protein was submitted to TMHMM server v2.0 to predict the transmembrane helix. According to the server's estimation, the transmembrane helix includes residues 7 to 28. Nine substitutions including S10L, L15F, S18C, S18Y, A19P, F21L, F25C, F25S and V27M have occurred in the transmembrane helical region, none of which were found to be associated with pathogenicity. Moreover, evolutionary conservation analysis of the nsSNPs located in this transmembrane helix did not found any conserved amino acid position other than Ser10 (Fig. 3).

Molecular dynamics simulation.
In order to determine which of the cancer driver nsSNPs should be subjected to MD simulation, we used all evaluation tools with stringent threshold of effectiveness/deleteriousness (Fig. 4). As a result, two cancer driver nsSNPs G73W and F432S voted by all the tools were selected for the structural evaluation by MD simulation. The structure of CYP1A2 (PDB ID: 2HI4 14 ) after removal of the ligand (alpha-naphthoflavone) was used as the wild-type (WT) protein. The initial structure of the G73W and F432S variants was obtained by substitution of the corresponding residues in the WT structure. Finally, variant and WT structures were subjected to 200 ns MD simulation to explore possible impacts of the substitutions on protein structure. www.nature.com/scientificreports/ Root mean square deviation (RMSD) of the alpha carbon (Cα) atoms for each frame with respect to the starting (1D-RMSD) and to all other frames (2D-RMSD) as well as radius of gyration (Rg) along the simulation time were calculated (Fig. 5). By comparing the 1D-RMSD trend it was found that the G73W (1.95 ± 0.20 Å) behaves more or less similar to the WT (1.94 ± 0.21 Å), whereas the F432S demonstrates minor deviation in the Cα atom positions (2.24 ± 0.28 Å). The 2D-RMSD plots indicate that WT and G73W variant converged to relatively stable conformations after about 40 ns of simulations (Fig. 5C), while for F432S variant, such a stable conformation was achieved after about 80 ns, suggesting that F432S has experienced more structural changes before running out into a stable structure (Fig. 5C). The measurement of Rg as a function of the simulation time also implied that the structures converged after about 80 ns (Fig. 5B). Taking these findings together and to be statistically comparable, the analyses were focused on those trajectories obtained from the last 120 ns of simulations (from 80 to 200 ns) for all the three proteins.
In order to gain more insight into the local structural changes around substitution sites, we extracted a conformation with minimum free energy as the representative structure using free energy landscape analysis from the last 120 ns of each MD simulation. Gly73 is located in a short loop just after A helix (residues 61-72). Substitution of this residue by tryptophan renders the indole ring of Trp to be captured by the positive charge of the guanidinium group of Arg90. As a result, a cation-pi interaction formed between Trp73 and Arg90 after about 39 ns of the simulation. The distance between the indole ring and guanidinium group is maintained at about 4 Å for 70 ns, after that the magnitude of the fluctuations increased (Fig. 6A). In this regard, 161 snapshots were extracted at every 1 ns from 40 to 200 ns of the simulation time and explored for cation-pi interactions by CaPTURE program 61 . The results taken from CaPTURE also confirmed the formation of cation-pi interaction between Trp73 and Arg90, although it was attenuated after 120 ns (Fig. 6B). On the other hand, analysis of the secondary structure showed the C-terminal end of the A helix became unstable after establishment of cation-pi interaction between Trp73 and Arg90 (Fig. 6C). The next substitution site, Phe432, is located in a 3 10 helix flanked by helices K' and K". In WT structure, most of the residues located within a radius of 5 Å from side chain of Phe432 are nonpolar residues, of which, Ala370, Leu373, Trp421, Ile440, Leu444 and Met448 have been shown to be involved in van der Waals interactions with the aromatic moiety of Phe432. By comparing WT and F432S representative structures, it was cleared that substitution of serine residue with a small polar side chain for F432 has led to disappearance of these interactions. A total of 176 nsSNPs were entered into the analysis by a variety of computational tools. In first step, 38 out of 176 nsSNPs were found to be associated with functional effects. Among them, 18 nsSNPs were predicted as pathogenic. Finally, 9 nsSNPs (G73R, G73W, R108Q, R108W, E168K, E346K, R431W, F432S and R456H) were also found to be cancer drivers. G73W and F432S cancer driver variants were then subjected to 200 ns molecular dynamics simulation. Conformations with minimum free energy were extracted as the representative structure for wild-type, G73W and F432S CYP1A2 proteins using FEL analysis and were visualized and superimposed using UCSF Chimera 1.11 (www.cgl.ucsf.edu/chime ra). We also conducted further analyses to explore the overall structural changes upon substitutions. The secondary structure content of the proteins was also measured during the analyzed time window. Both variants showed a decrease in the β-sheet (Fig. 7A) and α-helical content (Fig. 7B). The average number of β-sheet forming residues was reduced from 40 ± 2 in WT to 34 ± 3 in G73W variant and 33 ± 3 in F432S variant. The average number of residues participating in α-helix was also decreased from 220 ± 4 in WT to 212 ± 7 and 213 ± 6 in G73W and F432S, respectively. Detailed analysis of secondary structure elements revealed disruption of β-sheet 3′ in G73W variant (Fig. 7C) and β-sheet 4 in F432S variant (Fig. 7D). The results implied that no α-helix structure was completely lost, they were just shortened by one or more residues.
On the other hand, analysis of hydrogen bonds implied a decrease in the number of intramolecular hydrogen bonds in both variants as the average number of hydrogen bonds was reduced from 376 ± 9 in WT to 365 ± 9 and 363 ± 9 in G73W and F432S variants, respectively (Fig. 7E). It was also observed that the number of hydrogen bonds with occupancy above 70% has decreased from 264 in WT to 237 and 249 in G73W and F432S variants, respectively. The reduction in the number and strength of hydrogen bonds suggested a gain in the overall flexibility of the variants upon substitutions. So, to examine whether these substitutions affect protein overall flexibility, we performed principal component analysis (PCA). The Eigenvectors and eigenvalues were obtained from Table 2. Pathogenicity evaluation of functional CYP1A2 nsSNPs. Pred: Prediction, S: Score, P-value: Probability that a benign variant is misclassified as pathogen, FDR: False discovery rate, P: Pathogenic, N: Neutral/Benign. The nsSNPs classified as pathogenic by all five methods are highlighted in bold. www.nature.com/scientificreports/ diagonalization of the covariance matrices of the Cα atoms, and the principal components were generated by projecting the trajectories on the respective eigenvectors ( Fig. 7F and 7G). The trace of the diagonalized covariance matrix was found to be 530.27 Å 2 , 693.11 Å 2 and 931.91 Å 2 for WT, G73W and F432S variants, respectively, confirming an increase in the overall flexibility of the variants, of which the increase in the F432S variant is more drastic compared to that of G73W. www.nature.com/scientificreports/ In order to provide more insight into the protein structural flexibility, RMSF of the Cα atoms as a function of residue number was calculated over the last 120 ns (Fig. 8A). The RMSF graph has been highlighted with color blocks indicating α-helices and β-strands according to CYP1A2 crystallographic structure. The differences in per-residue RMSF (ΔRMSF) for G73W and F432S Cα atoms with respect to the WT were also measured and visualized in Fig. 8B. Positive values indicate more flexible residues and negative values show less flexible residues compared to those of WT. As seen in Figs. 8B and 8C, a significant increase in flexibility was measured for β-sheet 3′ and its flanking loops of G73W variant. Disruption of β-sheet 3′ due to breaking of two hydrogen bonds between Ala297 and Asn300 confirmed the higher flexibility in this region (Fig. 7C). In case of F432S variant, a sharp increase in fluctuation of the CD loop was particularly significant (Figs. 8B and 8C). Increased flexibility was also observed in F helix, FG loop, G helix and GH loop.

Pred (S) Pred (S) Pred (S) Pred (S) Pred (S, P-value,
Calculation of RMSD for Cα of the CD loop during the entire course of the F432S simulation demonstrated displacement of this loop after about 14 ns of the simulation (Fig. 9A). In addition, monitoring of the CD loop interactions revealed that F432S has lost the hydrogen bonding network in this region of the protein Table 3. Assessing the cancer susceptibility of pathogenic CYP1A2 nsSNPs using CHASM-3.1. P-value: Probability that a passenger variant is misclassified as driver, FDR: False discovery rate.   Table S3). The salt bridge between Asp152 from this loop and Arg281 from GH loop has also disrupted (Fig. 9B). The removal of these interactions thought to be the reason for displacement and higher mobility of the CD loop as well as GH loop in F432S variant. Another notable change was the significant weakening of the conserved hydrogen bond between Arg137 of the C helix and the heme propionate oxygen which occurred shortly after CD loop movement (Fig. 9A). By looking at the results described for CD loop, it may be concluded that displacement of the CD loop together with its increased flexibility have induced breakage of the Arg137-heme hydrogen bond.

Discussion
In this study, we performed a comprehensive in silico evaluation to identify CYP1A2 gene pathogenic nsSNPs using a wide variety of computational tools. To our knowledge only one study has been carried out to evaluate the nsSNPs of human CYP1A2 gene. Wang et al. using two tools SIFT and PolyPhen analyzed the functional impact of thirty-three nsSNPs of CYP1A2 gene and reported eleven nsSNPs as damaging substitutions 62 . We expanded our study to include more nsSNPs and hypothesized that a more reliable and accurate estimate of a substitution consequence could be provided by using a variety of computational methods that follow different approaches to distinguish between pathogenic and neutral variants. Although all predictive methods have been developed to estimate whether a given substitution has functional/pathogenic effect, it does not necessarily mean that they can elucidate the mechanism how the SNPs affect protein function or cause disease. This question could be explored using other experimental or computational techniques including MD simulation.
To test our hypothesis, we initially annotated the nsSNPs using a variety of computational methods to distinguish between functional and neutral variants. Assessing the pathogenicity of functional nsSNPs identified 18 pathogenic nsSNPs. Evolutionary conservation analysis indicated that almost all of the pathogenic nsSNPs occupy conserved amino acid positions. Moreover, the results obtained from CHASM-3.1 revealed a possible association between G73R, G73W, R108Q, R108W, E168K, E346K, R431W, F432S and R456H with risk of developing hepatocellular carcinoma.
Among nsSNPs predicted as cancer drivers, G73W and F432S were still voted by all methods even after applying more stringent thresholds. Accordingly, these variants were subjected to 200 ns MD simulations to explore the effect of substitutions on the protein structure. Findings demonstrated that these substitutions change protein structural features not only in proximity of the substituted residues but also in spatially distant regions. Both variants experienced a reduction in the number and strength of intramolecular hydrogen bonds as well as in β-sheet and α-helical content. Results derived from the principal component analysis (PCA) confirmed an increase in the overall flexibility especially for F432S variant. A drastic increase was also found for the CD loop flexibility which is a long serine-rich stretch (residues 148-158) extended into the solvent. Increased mobility of the CD loop has been recently reported upon simulation of R377Q 27 . In this regard, the experimentally observed loss of the enzymatic activity in R377Q variant has been attributed to the reduced heme stability due to the increased flexibility of the C helix, which is adjacent to the CD loop. The C helix is also adjacent to the heme prosthetic group and interacts with heme propionate oxygen via a conserved arginine residue (Arg137). Hence, any change in flexibility of the C helix could affect the stability of the heme. Moreover, C helix is one of the main regions involved in interaction with redox partners like cytochrome b5 (CYB5). CYPs binding to CYB5 is mediated through a groove on the proximal surface of the protein which includes C helix. There are also growing evidences for the involvement of CD loop in binding of some CYPs to the CYB5, although it appears the CYB5 interactive elements of various CYPs are type specific. For example, the interacting region on CYP3A4 in apo form consists of helices B, C, D, BB' and CD loops, β-bulge and meander region while on CYP2E1 is provided by helices C, J' , L, β-bulge and meander region 63,64 . Taken together, it seems reasonable to expect that the high mobility of the CD loop in F432S may affect heme stability as well as interaction with CYB5.

Conclusion
CYP1A2 is one of the main hepatic CYPs involved in the bioactivation of carcinogens and metabolism of clinically used drugs. Hence, nsSNPs of this enzyme could affect cancer susceptibility and drug efficiency. In current study, using a variety of computational tools, 38 out of 176 nsSNPs of human CYP1A2 gene were predicted as functional variants. The functional nsSNPs were further analyzed to trace possible association with pathogenicity and cancer susceptibility. As a result, 18 nsSNPs predicted as pathogenic, of which G73R, G73W, R108Q, R108W, E168K, E346K, R431W, F432S and R456H variants were also found to be associated with hepatocellular carcinoma. We also performed 200 ns MD simulations to explore how G73W and F432S cancer driver variants affect the protein structure. Simulation results revealed several significant structural alterations, particularly for F432S variant. Among them, increased flexibility of the CD loop and loss of the hydrogen bond between heme and Arg137 from C helix were the most prominent ones, because they could affect the heme stability as well as the protein interaction with cytochrome b5. These findings may be considered in designing experimental studies and provide novel insights into understanding the structure-function relationship in CYP1A2 and other CYPs. The gray and blue blocks indicate α-helices (denoted with letters) and β-strands (denoted with numbers), respectively. WT protein is shown in black, G73W in green and F432S in red. (B) Change in RMSF (ΔRMSF) of the Cα atoms for G73W (green) and F432S (red) with respect to the WT as a function of residue number. Positive and negative values of ΔRMSF indicate more and less flexible residues compared to the WT ones, respectively. (C) Three-dimensional representation of G73W and F432S structures colored based on ΔRMSF values. A Conformation with minimum free energy was extracted as the representative structure for G73W and F432S using FEL analysis and were visualized using UCSF Chimera 1.11 (www.cgl.ucsf.edu/chime ra). The position of β-sheet 3′ in G73W, CD and GH loops in F432S, showing significant differences in ΔRMSF values are marked on corresponding structures.