Pathogenic nsSNPs that increase the risks of cancers among the Orang Asli and Malays

Single-nucleotide polymorphisms (SNPs) are the most common genetic variations for various complex human diseases, including cancers. Genome-wide association studies (GWAS) have identified numerous SNPs that increase cancer risks, such as breast cancer, colorectal cancer, and leukemia. These SNPs were cataloged for scientific use. However, GWAS are often conducted on certain populations in which the Orang Asli and Malays were not included. Therefore, we have developed a bioinformatic pipeline to mine the whole-genome sequence databases of the Orang Asli and Malays to determine the presence of pathogenic SNPs that might increase the risks of cancers among them. Five different in silico tools, SIFT, PROVEAN, Poly-Phen-2, Condel, and PANTHER, were used to predict and assess the functional impacts of the SNPs. Out of the 80 cancer-related nsSNPs from the GWAS dataset, 52 nsSNPs were found among the Orang Asli and Malays. They were further analyzed using the bioinformatic pipeline to identify the pathogenic variants. Three nsSNPs; rs1126809 (TYR), rs10936600 (LRRC34), and rs757978 (FARP2), were found as the most damaging cancer pathogenic variants. These mutations alter the protein interface and change the allosteric sites of the respective proteins. As TYR, LRRC34, and FARP2 genes play important roles in numerous cellular processes such as cell proliferation, differentiation, growth, and cell survival; therefore, any impairment on the protein function could be involved in the development of cancer. rs1126809, rs10936600, and rs757978 are the important pathogenic variants that increase the risks of cancers among the Orang Asli and Malays. The roles and impacts of these variants in cancers will require further investigations using in vitro cancer models.

Single nucleotide polymorphisms (SNPs) are the major type of genetic variation in humans (~ 90%). Thus far, around 500,000 SNPs have been reported on the coding regions of the human genome 1 . Among these, the nonsynonymous SNPs (nsSNPs) change the residues of amino acids of the protein sequences and may have damaging or neutral effects on the protein functions or structures 2,3 . Damaging nsSNPs may affect the function or structure of a protein by modifying the protein charge, geometry, hydrophobicity 4 , stability, dynamics, translation, and protein interactions 5,6 . These are probably the significant factors that contribute to the functional diversity of encoded proteins in the human population 7 . Therefore, many human diseases could be due to these damaging nsSNPs.
Previous studies have shown that nsSNPs cause numerous genetic disorders such as inflammatory and autoimmune disorders and cancers [8][9][10] . With the massive human genome sequence data now available and we are yet to know the functional effects of some of the SNPs, a more cost-effective approach is required to unravel the functions of the unknown SNPs effects. Many studies have used bioinformatics tools to predict the deleterious effects of nsSNPs on the functions of proteins that result in diseases before expensive in vitro or in vivo experiments are conducted. Two nsSNPs on the ABCB1 gene had been associated with breast cancer, and these SNPs were predicted for their deleterious effects, which caused the change in protein conformation using comprehensive bioinformatics analysis 11 . A similar study using functional and structural bioinformatics tools had identified three damaging nsSNPs that alter the functions and structures of the RNASEL gene. These nsSNPs are most likely pathogenic and associated with the increase of prostate cancer susceptibility 12 . nsSNPs in the KRAS gene have been found to be associated with lung cancer due to their damaging effects on the structural features of the The structure and function of the native proteins were found to be altered due to the nsSNPs using a pipeline comprised of several bioinformatics tools 13 . A recent study had identified the deleterious nsSNPs on the hOGG1 gene that altered the secondary structure of the expressed protein and destabilized its local conformation, which increases the risks for lung cancer 14 . Furthermore, in-silico modeling has been widely used to assess the functional impacts of nsSNPs and their possible roles in cancers 15,16 . in-silico modeling has the advantage of being able to make rapid predictions for the mechanisms of actions of a wide range of compounds in a highthroughput mode. Another advantage is that prediction can be made based on the structure of a compound before it issynthesized 17 .
Databases of human variants have been developed with different scopes and contents used to predict diseases 18 in achieving personalized medicine 19 . The genome-wide association study (GWAS) database (https:// www. ebi. ac. uk/ gwas/) is widely used to associate SNPs with diseases. Although there are other existing human variant databases such as ClinVar, COSMIC, SwissVar, and Humsavar, GWAS is the only database that gives a world of information or catalogs on disease mutations in different populations. This database also provides information on the statistically significant variants and the increase/decrease associated risks for each phenotype 20 .
The application of genomics, bioinformatics, and the availability of data generated from high-throughput technologies are the fundamental tools for implementing precision medicine not only for cancer diseases but also for other common and rare diseases 21,22 . Various tools have been used to predict the functional effects of nonsynonymous coding variants using basic sequence homology [23][24][25] ; empirically derived rules 26 ; structural and functional features [27][28][29] ; a weighted average of the normalized scores 30 ; decision trees 31,32 ; support vector machines [33][34][35][36] ; and Bayesian classifiers 27 . A comprehensive systematic evaluation study on the performances of these widely used prediction methods to identify the pathogenicity of the SNPs is required 37 . While new and more algorithms are being developed, the accuracy of prediction using a combination of the different algorithms should be validated. It is recommended that different computational methods are used to determine the impact of different SNPs during the screening step, and further validation should be incorporated in studying the impacts of nsSNPs on specific proteins 38 . In addition, complementary methods could be combined in a meta-server to yield more reliable predictions 39 . Several recent studies had reported on the use of a combination of various methods to uncover the potential impact of the nsSNPs in understanding the molecular mechanisms of various diseases, which includes cancers [40][41][42][43][44] . The combination of these tools allows more accurate prediction using the multiple conservation, structural, or combined methods (conservation and structural). Therefore, combined methods and meta-prediction methods (predictors that integrate multi-predictor results) are important for biomedical applications. This is because they can be applied to a much greater number of single nucleotide variants, considering that many human proteins do not currently have an experimentally defined structure or a close homolog to construct a model. Thus, combined and meta-prediction methods have a wide range of potential applications using the combinations of features yet to be explored 45 . As GWAS is usually conducted on a large population size using a high throughput detection method and is costly, some world populations were not studied. Therefore, their disease risks are not available. The Orang Asli are still practicing traditional healing methods, therefore the record on the incidence of cancers among the Orang Asli is lacking. This has posed challenges to the authorities to strategize health programs to ensure the sustainability of the Orang Asli. Due to the lack of phenotypic data on cancers, mining the genomes of the Orang Asli to predict their susceptibility for the different types of cancers would provide important data that allows the scientists to strategize research focus areas and for the authorities to provide relevant funding. In this study, we aimed to develop and validate a bioinformatics pipeline to detect and annotate the cancer-associated nsSNPs of a genome database and predict the structural and functional impacts of these nsSNPs that might increase the risks of cancers among the Orang Asli. Using the same pipeline, we also investigate the cancer risks of the Malays, which constitute the biggest population in Malaysia. The database of the Malay genomes was provided by Wong et al. 46 and lacks information on the phenotypic traits, therefore it is interesting to predict the cancer susceptibility risks for this cohort using the established pipeline. The pipeline is developed using multiple bioinformatic tools in order to analyze the most deleterious and damaging nsSNPs associated with cancers. It includes the steps used for mining and annotating the genotypes and in silico modeling to predict the structural and functional impacts of the genetic variants with unknown functions. The new variants with potential impacts would be subsequently investigated in our laboratory using zebrafish models, and genotyping methods targeting the nsSNP would be developed for population study. In this study, three-dimensional (3D) protein models of the native and their variants (or mutant) were prepared. This is the first report which covers a comprehensive in silico analysis of three (3) nsSNPs, rs1126809, rs10936600, and rs757978 for TYR, LRRC34, and FARP2 proteins, respectively. This study is a part of our initiatives to enhance precision health in our country. The bioinformatics pipeline developed in this study will be used in the future to predict genomic variations associated with different diseases.

Methods
Whole genome sequences. The whole-genome sequences of ninety-eight (98) 46 . Malays are Austronesians-speaking ethnic group who mainly live in Malaysia, Indonesia, and Singapore in the Southeast Asian region 46 . The mean coverage of the whole-genome sequences of Singapore Malays across all the 96 samples was 47.6x. The depth of coverage for each sample ranged from 35.5 × to 81.9x. All the genomic DNA of 96 Malays individuals was collected from the Singapore BioBank. Picogreen was used to measure fluorescence intensity, and the SpectraMax Gemini EM microplate reader was used to confirm that the DNA content was greater than 50 ng/l using spectrophotometric settings at 480/520 nm (Ex/Em). Subsequently, DNA samples were sent to the Defense Medical and Environmental Research Institute for preparation. Whole-genome sequencing of 96 Malays were then performed using the Illumina HiSeq 2000 with a target of > 30 × coverage.
Variant calling pipeline was performed using HaplotypeCaller and BaseRecalibrator (GATK v2.5) 98 for each sequence data (bam file format) of the Orang Asli and Malays. The HaplotypeCaller was used to detect variants and BaseRecalibrator was used for base quality score recalibration (BQSR). Vcf files for each sample were generated for quality-filtering. Variant filtering was performed using SelectVariants (GATK v2.5) 98 , to extract SNPs and exclude variants with a read depth of less than 5 or a quality Phred score of less than 30.
The study protocol was approved by Universiti Bioinformatics workflow. High-risk nsSNPs associated with cancer were classified using the GWAS-Catalog as the source of the dataset, and various bioinformatics tools were employed in the workflow (Fig. 1).

Nonsynonymous SNPs datasets for validation.
The sensitivity, specificity, and accuracy of the functional effect prediction were determined using a combination of five different algorithms (SIFT, PolyPhen-2, Condel, PROVEAN, and PANTHER), with and without conservation (Consurf) and protein stability (I-Mutant). The standard dataset used comprised of nsSNPs associated with breast cancer from ClinVar. The ClinVar dataset includes a total of 100 clinically tested nsSNPs in which 50 nsSNPs were reported as pathogenic while the other 50 nsSNPs were reported as benign (Table S1). The 100 nsSNPs training dataset were randomly chosen out of 1020 clinically tested nsSNPs associated with breast cancer reported in the ClinVar as it is one of the most commonly studied cancer dataset. Although the dataset is primarily associated with breast cancer, the main purpose of using the training dataset is to test the ability of the pipeline to detect all the deleterious nsSNPs. Additionally, the sample size chosen also is sufficient as concluded by Thusberg et al., that the analysis result of using a small dataset (100SNPs) is comparable to a larger size (1000 SNPs) for a training dataset 37 . Datasets of different types of cancer and a larger sample size may also be used to achieve the same objective.
Sensitivity (Se) is a proportion of the true-positive results (correct identification of pathogenic variants), according to Eq. (1).
where TP denotes true-positive cases, and FN denotes false-negative cases. Specificity (Sp) is a proportion of the true negative results (correct identification of benign variants), according to Eq. (2).
where TN denotes true negative cases, and FP denotes false-positive cases.
Accuracy (Ac) is the ratio of complete, correct predictions to the total number of predictions, according to the following Eq. (3).
Datasets. Information on the genetic variants associated with cancers (SNP ID) was retrieved from the GWAS-Catalog database (https:// www. ebi. ac. uk/ gwas). Residue change, risk allele frequency, phenotype, and protein accession number were retrieved from The NHGRI GWAS Catalog 20 . The dataset was built after 179,365 genetic variants were filtered based on the keywords' cancer' , ' carcinoma' , 'glioma' , 'leukemia' , 'lymphoma' , 'melanoma' , and 'sarcoma' (Table S2).  52 . SIFT predicts the effects of an amino acid substitution on protein functions. The sequence homology and the physiochemical characteristics were computed using a normalized probability score (SIFT score) for each substitution 25 . PolyPhen-2 predicts the potential effect of an amino acid substitution on both protein structure and function using a combination of multiple homolog sequence alignment-based methods and protein 3D structure. The prediction is provided as benign, possibly damaging, and probably damaging according to the scores differences of the position-specific independent count (PSIC) between 2 variants (native amino acid and mutant amino acid) 27 . Condel predicts the effect of coding variants on protein function based on the ensemble score of multiple prediction tools (SIFT, PolyPhen-2, FATHMM, and Mutation Assessor) 50 . PROVEAN predicts the functional effects of protein sequence variations, including single or multiple amino acid substitutions and in-frame insertions and deletions 104 . PANTHER estimates the likelihood of a particular nsSNP to cause a functional effect on the protein using position-specific evolutionary preservation 52 . The description of the tools used is presented in Table 1.
The nsSNPs were considered high-risk if they were predicted to be damaging or deleterious by at least four bioinformatics tools. They were then subjected to further analysis.
Analysis on conservation of protein evolutionary. ConSurf (consurf.tau.ac.il/) is a bioinformatics tool that was utilized to predict the evolutionary conservation of amino acid in CACFD1, RREB1, LRRC34, ETFA, CPVL, INCENP, FARP2, and TYR protein. It is a web server that builds phylogenetic relationships between homologous sequences to estimate the evolutionary conservation of amino acid positions in a protein or DNA molecule. The conservation analysis on the target proteins was performed to show the significance of each residue position for the protein structure or function. The rate of evolution was determined based on the evolutionary relationship between the protein or DNA, its homologs, and the similarity between amino (nucleic) acids as expressed in the substitutions matrix. Furthermore, Consurf offers an accurate estimation of the evolutionary rate using either an empirical Bayesian approach or a maximum probability (ML) method 47 . Protein sequence in FASTA format was used as the input. UniProtKB accession numbers for the sequences are: CACFD1, Q9UGQ2; RREB1, Q92766; LRRC34, Q8IZ02; ETFA, P13804; CPVL, Q9H3G5; INCENP, Q9NQS7; FARP2, O94887; and TYR, P14679. Consurf created an output consists of the protein sequence and multiple sequence alignment colored by conservation scores. The conservation score ranged from 1 to 9, where 1 to 4 is considered as variable, 5 to 6 as intermediate, and 7 to 9 as conserved amino acid position. We selected those residues with a high score for the high-risk nsSNP for further analysis.
Analysis of protein stability. I-Mutant Suite is a web server (http:// gpcr2. bioco mp. unibo. it/ cgi/ predi ctors/I-Mutan t3.0/ I-Mutan t3.0. cgi) 54 that was used to predict the stability of protein changes caused by a single point mutation. This tool is trained on a ProTherm-derived data set which is the most extensive database on experimental thermodynamic data on free energy changes, which measures protein stability due to mutations 107 . We submitted the protein sequences of selected nsSNPs to predict the impact on the protein stability of the damaging nsSNPs. UniProtKB accession numbers for the sequences are: CACFD1, Q9UGQ2; RREB1, Q92766; LRRC34, Q8IZ02; ETFA, P13804; CPVL, Q9H3G5; INCENP, Q9NQS7; FARP2, O94887; and TYR, P14679. The output included the indicator of the prediction (increase/decrease) of protein stability based on the reliability index (RI) and the predicted Gibbs free energy change (ΔΔG or DDG). The DDG value (kcal/mol) is computed from the unfolding Gibbs free energy value of the mutant protein minus the unfolding Gibbs free energy value of the native protein. The RI ranges from 0 to 10, where 10 is the highest reliability 107 . The free energy change values were categorized into three classes: (i) DDG < − 0.5 kcal/mol as destabilizing mutations; (ii) DDG > 0.5 kcal/mol as stabilizing mutations; (iii) − 0.5 < = DDG < = 0.5 kcal/mol as neutral mutations 108 .
Three-dimensional (3D) protein modeling. The 3D structures of native and mutant (due to nsSNPs) proteins were constructed to explore the differences in the structural stability between the native and mutant proteins. The iterative threading assembly refinement (I-TASSER) server is an integrated platform that provides automated protein structure and function prediction based on the sequence-to-structure-to-function framework 109 . It was employed for the prediction of 3D protein models of native and mutant protein structures with high-risk nsSNPs. It has the most advanced algorithm to build high-quality 3D protein model from amino acid sequences. I-TASSER generates a full-length model of proteins by excising continuous fragments from threading alignments and then reassembling them using replica-exchanged Monte Carlo simulations. SPICKER clusters low-temperature replicas (decoys) generated during the simulation, and the top five cluster centroids are selected for generating full atomic models. The accuracy of the predicted model is reflected in the form of the confidence score (C-score). The C-scores range is between 5 and 2. The greater values of the C-score display www.nature.com/scientificreports/ higher confidence for the predicted model 109 . The best model for each query protein was selected according to C-score values. Default parameters were used for each of the protein structures. The amino acid sequences of the proteins to be modeled were prepared in the FASTA format as input for the server to predict the native and mutant models. The predicted structures were loaded into PyMOL to visualize their molecular structures. PyMol was used to visualize the molecular structures in high-quality 3D images.
The qualities of all predicted protein structures were then validated by ERRAT tools (https:// servi cesn. mbi. ucla. edu/ ERRAT/) 110 , and Ramachandran Plot. (https:// zlab. umass med. edu/ bu/ rama/) 111 . ERRAT program analyzed the statistics of noncovalent interactions between three types of atoms, which are carbon (C), nitrogen (N), and oxygen (O). Consequently, six types of interactions are possible (CC, CN, CO, NN, NO, and OO). Ramachandran Plot illustrates the statistical distribution of the combinations of the backbone dihedral angles ϕ and ψ. in protein structures. The number of residues in the allowed or disallowed regions of the Ramachandran plot determines the quality of the model. Template modeling aligns (TM-align) was used for comparison between the predicted native and mutant protein models. Its algorithm identifies the best structural alignment between the protein pairs based on the combination of template modeling-score (TM-score), root means square deviation (RMSD), and the superposition of the structures 69 . TM-score scores range from 0 to 1, where 1 represents the ideal match between two protein structures. In contrast, the higher value of RMSD represents a more significant difference between native and mutant structures.
Identification of functional and structural properties. MutPred v1.2 and HOPE were used to identify the functional and structural properties of the selected nsSNPs. MutPred is a web application tool that effectively classifies amino acid substitution as being associated with a disease or neutral in human (http:// mutpr ed. mutdb. org/). This tool also helps in predicting the deleterious amino acid substitution or molecular cause of disease 112 . It focuses on a wide range of structural and functional properties, including secondary structure, signal peptide and transmembrane topology, catalytic activity, macromolecular binding, PTMs, metal-binding, and allostery 106 . Protein sequences (FASTA format) of the identified genetic variants and their amino acid substitutions were submitted. MutPred v1.2 generated output scores indicating the probability of deleterious or diseaseassociated amino acid substitution. The top five features with P value impact on the functional and structural properties would be recorded. The predicted scores were classified based on three hypotheses; (i) g > 0.5 and p < 0.05 as actionable hypotheses; (ii) g > 0.75 and p < 0.05 as confident hypotheses; (iii) g > 0.75 and p < 0.01 as very confident hypotheses.
HOPE is a web service tool that was used to identify the structural effects of a point mutation on human protein sequence (www. cmbi. ru. nl/ hope) 113 . The protein sequences of the selected nsSNPs were submitted as input. HOPE generated results based on the collected and combined information from several web services and databases. Initially, the algorithm included BLAST against PDB and UniProt to obtain details on the tertiary structure to build a homology model. It was followed by the prediction of the protein features using the Distributed Annotation System 114 .
ModPred (http:// www. modpr ed. org/) 105 is a web server tool that was used for the prediction of post-translational modification (PTM) sites in proteins based on sequence-based features, physicochemical properties, and evolutionary features. A total of 34 logistic regression models were used in ModPred for 23 different PTM sites to simultaneously predict and analyze multiple types of PTM sites to obtain information on the functional and structural impacts of multiple PTM protein regulatory mechanisms. The 34 ensembles of logistic regression models were trained independently for 23 PTMs on a total collection of 126,036 experimentally tested nonredundant protein sites extracted from various public databases such as SwissProt, HPRD, PDB, Phospho.ELM, PhosphoSitePlus & PHOSIDA and literatures 105 . The PTM sites were predicted to have either low, medium, or high confidence scores. Sites with low confidence have scores of at least 0.5. In contrast, PTM sites with medium and high confidence have different predictor scores that were based on sensitivity and specificity estimates for each of the modifications models as given by ModPred.
Prediction of protein-protein interactions. STRING is a database and web resource dedicated to protein-protein interactions network, including direct (physical) and indirect (functional) interactions 115 . The database contains data from genomic context, experimental repositories, co-expression, and collections of public text 116 . The available information in the database will allow us to identify and further understand the experimental and/or theoretical interaction for TYR, FARP2, and LRRC34 for this study.  117 were used as receptors for LRRC34, FARP2, and TYR respectively. The peptide sequences from native and mutant FARP2, LRRC34, and TYR protein structures were used as the ligands for the docking procedure. The peptide sequences of at least nine amino acid residues of each of the native and mutant FARP2, LRRC34, and TYR proteins were converted into Simplified Molecular-Input Line-Entry System (SMILES) strings by using the online tool PepSMI (https:// www. novop rolabs. com/ tools/ conve rt-pepti de-to-smiles-string). The peptide sequences used for the analysis were SGIQQLCDAL, FQGTT-KINT, and FEQWLRRHR from native LRRC34, FARP2, and TYR protein and SGIQQICDAL, FQGTNKINT and FEQWLQRHR from mutant LRRC34, FARP2 and TYR protein, respectively. The three-dimensional structure for each ligand was then generated by the Build Structure tool within UCSF Chimera 1.15 software using SMILES as an input. Target proteins and ligands were optimized using the Dock Prep tool from UCSF Chimera 1.15 software 118 with default parameters before docking analysis. These steps include removing solvents, adding hydrogens, and determining the charge. We maximized the grid box size along with the axes X, Y and, Z accordingly to define the binding sites for conducting the docking.

Results
Standard dataset. The dataset contains a total of 100 nsSNPs in which 50 nsSNPS were reported as pathogenic, and 50 nsSNPs were reported as benign (Table S1). The parameters investigated were compared and are presented in Table 2. The sensitivity, specificity, and accuracy of the prediction for the clinical significance of the nsSNPs were calculated for four (4) models (Model A, B, C, and D). Model A represents at least one tool that predicted nsSNPs as deleterious or benign, and it showed the highest sensitivity (100%), followed by Model B (92%), Model C (90%), and Model D (84%). For specificity and accuracy, Model D showed the highest percentages (specificity 94%, and accuracy 89%) followed by Model C (specificity 80%, and accuracy 85%), Model B (specificity 64%, and accuracy 78%), and Model A (specificity 50%, and accuracy 75%). Further analyses were conducted using the combination of five functional effect tools which investigate the conservation and stability (Model A3, B3, C3, and D3). These models resulted in lower sensitivity of deleterious and benign nsSNPs compared to Model A, B, C and D. Interestingly, Model D3 showed the highest specificity (96%) compared to other models (Model A, B, C, D, A3, B3, and C3). However, Model A3, and B3 showed higher accuracy (88%) compared to Model D (89%) and Model C (85%). Cancer-related nsSNPs for whole-genome sequences of Orang Asli and Malays. All of the identified SNPs were searched against the SNPs dataset retrieved from GWAS. Out of 80 nsSNPs associated with cancers from the dataset, a total of 52 nsSNPs were found among the Orang Asli and Malays (43 in Orang Asli and 43 in Malays), as presented in Table 3. Thus, we selected all the 52 identified nsSNPs associated with cancer risks among the Orang Asli and Malays for further investigation. In this study, we shortlisted 52 nsSNPs with at least four significant scores out of five algorithmic tools used: score < 0.05 in SIFT, > 0.9 in PolyPhen-2, < − 2.5 in PROVEAN, 1.0 in CONDEL, and > 450 million years in PANTHER. Therefore, only the most deleterious nsSNPs would be studied. Based on the scores, 6 out of 43 nsSNPs in the Orang Asli and 6 out of 43 nsSNPs in the Malays were shortlisted. Interestingly, four nsSNPs were found in both populations ( Table 3). As a result, the analysis identified eight deleterious amino acid substitutions responsible for the high-risk nsSNP associated with cancers ( Table 3). The nsSNPs which are classified as high risk are rs3124765, rs9379084, rs10936600, rs1801591, rs117744081, rs2277283, rs757978 and rs1126809. They are located on different genes, which are CACFD1, RREB1, LRRC34, ETFA, CPVL, INCENP, FARP2, and TYR , respectively. According to the GWAS database, the eight (8) nsSNPs were associated with the risk of specific cancers, as shown in Table 3. Thus, these eight (8) nsSNPs were further investigated.
Conservation profile of high-risk nsSNPs. ConSurf was further used to investigate the potential impact of the most deleterious nsSNP. It was used to measure the degree of evolutionary conservation of the protein for each amino acid residue. It identifies amino acid positions known to have functional and structural importance through the combination of evolutionary conservation data and solvent accessibility predictions 47 . In this study, all residues of each protein obtained from Consurf were assigned with conservation levels graded with scores ranging from 1 to 9. However, we concentrated only on residues that mapped to the locations of eight (8) high-risk nsSNPs, which we had identified. The server predicted D1171N, I58M, L286, T171I, Y168H, M506T, R402Q, and T260N as highly conserved (Table 4) and their functional and structural importance. The findings further indicated that these eight (8) high-risk nsSNPs were certainly deleterious to the protein functions and structures.
Predicted stability modification. We predicted the stability modifications due to nsSNPs in CPVL, FARP2, CACFD1, RREB1, LRRC34, ETFA, TYR, and INCENP proteins with the help of I-Mutant. The eight (8) nsSNPs that were found associated with cancers were submitted to the I-Mutant 3.0 server to predict the changes in the stability in terms of their free energy change value (ΔΔG) and reliability index (RI). Based on the ΔΔG values, all of these nsSNPs have decreased the stability of the respective proteins (Table 5). However, we had excluded two of them, rs1801591 (RI = 0) and rs117744081 (RI = 4), from analysis as they had RI below five (< 5). The higher RI value shows higher accuracy in the prediction for stability 48 . Thus, the other six nsSNPs (rs3124765, rs9379084, rs10936600, rs2277283, rs757978, and rs1126809) were further analyzed.
Homology modeling of protein. The three-dimensional (3D) structures of 6 native and mutant proteins were predicted by I-TASSER. In generating the mutant models, all six sequences were submitted to the I-TASSER, where each nsSNP was substituted into the native sequence. UniProtKB accession numbers for the native sequences used are LRRC34, Q8IZ02; FARP2, O94887; and TYR, P14679. The available top 10 templates protein models in PDB which are structurally closest to the query protein sequence were used to model the native and mutant proteins of LRRC34, FARP2 and TYR using I-TASSER. Among the six predicted models for each query protein (LRRC34, TYR, FARP2), the best model was selected based on the highest confidence score (C-score), as shown in S3  (Fig. 2). Compared to the native structure of LRRC34, FARP2, and TYR proteins, their mutant structures have more helixes as presented in Table 6. The numbers of beta-sheets were also different between the native and mutant proteins. The native protein structure of LRRC34 and TYR have more beta sheets when compared to their mutants. In contrast, the native protein structure of FARP2 has three fewer beta-sheets than its mutant. There are three and two more buried residues in the native LRRC34 (432) and FARP2 (1007) proteins compared to their mutants, respectively. However, buried residues in the native TYR (509) are less than its mutant protein.
TM-scores and RMSD values of each mutant model were calculated using TM-align. TM-score measures the similarity of topological models for native and mutant proteins, whereas RMSD evaluated the average distance from native α-carbon backbones to mutant models. The mutant model with the highest TM-score value is T171I (0.975), followed by R402Q (0.938), L286I (0.934), T260N (0.929), and Y168H (0.909). The highest TM-score value indicates that the mutant models generated are still in the same folding dimension of the native models but not perfectly the same. Besides, these mutant models were found to be different from the native based on RMSD values shown in Table 5. The nsSNP models of I58M, D1171N, and M506T have very low TM-score values of 0.346, 0.319, and 0.262, respectively, which correspond to randomly chosen unrelated proteins 49 (Table S3). Ramachandran plots for the native and mutant LRRC34, FARP2, and TYR protein models showed 87.74%, 71.75%, 85.00%, 87.50%, 69.16, and 85.00% of the residues were located in the allowed regions, and only a few amino acids were deviated (Table S3)].
Those three selected mutant protein models were then superimposed on the native protein models to show the location of observed mutations (Fig. 2). The details of the selected native and mutant protein models included the protein templates used to predict the structures and C-score are provided in the Table S3.

Functional and structural modifications of genetic variants. Three (3) nsSNPs were shortlisted and
submitted to the MutPred2 server. MutPred2 predicts the modification of structural and functional protein structures, including the altered order or disordered interface, transmembrane protein, metal binding, DNA binding, loss of allosteric site, and gain of allosteric site. Based on Table 7, the R402Q mutation showed the highest probability score (0.78), followed by T260 mutation (0.73) and L286 mutation (0.55). An amino acid substitution is predicted as pathogenic if a probability score is 0.50 and above.
HOPE was further used to explore the structural effects of these three amino acid substitutions. It was shown that the substitution of L286, T260, and R402 were highly conserved. Based on Fig. 3, the L286I mutation is buried in the core domain, whereas the R402Q mutation was changed to a smaller size amino acid while T260N was changed to a bigger size amino acid than the residue in native protein. Besides, the substitution of amino acid R402Q and T260N had resulted in the change of the net charge of TYR protein and hydrophobicity value of FARP2 protein.
ModPred tools predict possible post-translational modification (PTM) sites to investigate the effects of PTMs on the three substitutions of amino acid L286I, T260N, and R402Q in LRRC34, FARP, and TYR proteins, respectively. Post-translational modifications (PTMs) play a crucial role in regulating many biological processes, such as protein-protein interaction network, protein stability and enzymatic activity, and others. ModPred tool had predicted proteolytic cleavage sites of the substituted amino acids L286I, T260N, and R402Q in LRRC34, FARP, and TYR proteins, respectively (Table 7). Proteolytic cleavage is a PTM that induces activation, inactivation, entirely changed protein structure, excision of new N or C termini with growth factor activity from the parent molecule of an extracellular matrix and regulates a vast range of biological processes. These involve DNA replication, cell proliferation, cell cycle progression, and cells death, and inflammatory processes such as Table 6. The top 10 templates used for homology modeling, and the alpha helix, beta sheet and exposed/ buried residues used by I-TASSER.  www.nature.com/scientificreports/ arthritis, cancer, cardiovascular disease, and inflammation. This represents a remarkably significant prediction by ModPred (Table 7).
Molecular docking analysis. Autodock Vina, UCSF Chimera 1.15 tools predicted and evaluated a total of 10 protein binding sites along with hydrogen bond interaction and their binding affinities from the docking analysis. The resulting interactions between the native and mutant LRRC34, FARP2, and TYR were compared with those calculated docking results in the same protein binding sites using the exact dimensions of the grid boxes. Thus, a binding site was predicted for each receptor-ligand docking. Molecular docking of SRC, DCT, and MYNN with native and mutant FARP2, TYR, and LRRC34 modeled structures showed differences in the binding affinities ( Table 8). The binding affinity of SRC with native FARP2 was − 8.2 kcal/mol, while for mutant was − 7.8 kcal/ mol. The binding affinity of DCT with native TYR was − 8.1 kcal/mol, while for mutant was − 8.0 kcal/mol. The binding affinity of MYNN with native LRRC34 was − 5.4 kcal/mol, while for mutant L286I was 5.2 kcal/mol. In www.nature.com/scientificreports/ addition, SRC, DCT, and MYNN were bound to the same binding pockets for the native and mutant FARP2, TYR, and LRRC34 proteins, respectively. From the analysis of the binding pose, these three proteins (SRC, DCT, and MYNN) showed significant deviations between the native and mutant protein complexes (Fig. 5). Moreover, interaction analysis of SRC, DCT, and MYNN with the native and mutant FARP2 TYR and LRRC34 proteins showed a reduction in the number of hydrogen bonds with residues in mutant proteins (Table 8). Five residues such as Lys68, Tyr65, Leu5, Ser164, and Gln167 have interactions with SRC in native FARP2 but were absent in mutant proteins. Three residues, Lys152, Ser134, and Lys152, interact with DCT in native TYR but were absent in mutant proteins. Two residues, Asn39 and Ala42, have interactions with MYNN in native LRRC34, but Asn39 was absent in mutant protein.

Discussion
The exponential increase in the number of nsSNPs detected makes the investigation of the biological significance of each nsSNP by wet laboratory experiments impossible. Alternatively, in silico programs may be used to predict the effects due to mutations and explain the underlying biological mechanisms. nsSNPs in the coding regions can lead to amino acid change and alterations in protein function and account for susceptibility to disease. Identification of deleterious nsSNPs from tolerant nsSNPs is important in analyzing individual susceptibility to disease and understanding disease pathogenesis.
In this study, we have developed a pipeline (Fig. 1) to identify the pathogenic nsSNPs associated with cancers. Although there are various computational tools available to predict the deleterious or damaging effects of nsSNPs on protein structure and function, we had used five different tools (SIFT, PolyPhen-2, Condel, PROVEAN, and PANTHER) to determine the nsSNPs functional effects, while Consurf was used to estimate the evolutionary conservation of the amino/nucleic acid positions in a protein/DNA and protein. I-Mutant 3.0 was used to predict the impact of nsSNPs on the functions or structures of the pathogenic proteins. Among them, SIFT algorithm is the most commonly used tool for SNP characterization to determine deleterious nsSNPs. This method computes a conservation score that provides an insight into the impact of nsSNPs on the functional property of proteins 25 . PolyPhen-2 is considered one of the most reliable tools to predict the functional impact of nsSNPs based on protein sequence, phylogenetic information, and structural information 27 . Condel on the other hand integrates and reflects the combination of scores from different methods (SIFT, PolyPhen2, Mutation Assessor, FATHMM) to  www.nature.com/scientificreports/ classify the nsSNPs. It provides insight into the impact of the mutation on the biological activities of the proteins affected 50 . PROVEAN algorithm is capable of predicting the functional impacts of the amino acid substitution on a protein sequence with commensurable performance and accuracy. It utilizes alignment-based scores to measure the change in sequence variation correlated with the biological function of a protein 51 . Additionally, SIFT, PolyPhen-2, Condel, and PROVEAN, are easy and quick to employ, which allows direct batch queries. Other tools include PANTHER, a powerful and unique method with a curated database of protein families, trees, subfamilies and functions, and evolutionary relationships. It uses phylogenetic trees, multiple sequence alignments, and statistical technique to evaluate the deleterious effects of nsSNPs, making it a viable platform for SNP characterization 52,53 . Consurf is another widely used tool that can pinpoint critically important sites (nsSNPs) within the functional regions. It is a statistically robust approach that estimates the evolutionary rates due to amino acids substitutions and maps them onto the homologous sequence and/or structures 47 . I Mutant 3.0 tool measures the change in protein-free energy caused by a specific mutation 54 . It helps to detect the changes in protein 3D conformation stability. The tools used in this study cover a wide range of prediction techniques (Table 1), combining the findings from each tool in the pipeline will help to identify the most deleterious nsSNPs more accurately. Specific targeted genotyping assays could be developed to detect these nsSNPs identified to be impactful and further investigated in a local cohort of cancer patients. The prediction can also help scientists to focus their study on understanding the impact of these nsSNPs by prioritizing the most deleterious nsSNPs. www.nature.com/scientificreports/ The bioinformatics workflow developed was validated using the breast cancer dataset from ClinVar, which acts as a standard dataset. The standard dataset has been annotated and we believe it is the most appropriate dataset for functional effect prediction. The standard dataset contained a total of 100 nsSNPS that were clinically associated with breast cancer (Table S1). The sensitivity, specificity, and accuracy of four models (Model A, B, C, and D) in predicting the clinical significance were determined. Model D represents at least four tools that predicted nsSNPs as deleterious or benign, and it showed the highest percentages of specificity (94%), and accuracy (89%,) followed by Model C (specificity 80%, and accuracy 85%), Model B (specificity 64%, and accuracy 78%) and Model A (specificity 50%, and accuracy 75%). While Model A has the highest sensitivity (100%) followed by Model B (92%), Model C (90%), and Model D (84%). The highest sensitivity scores mean that fewer potentially deleterious nsSNPs were missed. Thus, we concluded that Model D using at least four out of five tools had the best performance in predicting the most deleterious nsSNPs.
Further analyses using the combination of five functional effect tools with conservation and stability tools showed that Model D3 had the highest specificity (96%), but the lowest sensitivity (76%) in identifying deleterious and benign nsSNPs. Despite not having the highest accuracy, Model D3 was able to classify both pathogenic and benign SNVs accurately (86%). The validated workflow is adequate with good sensitivity, specificity, and accuracy to classify the deleterious and neutral nsSNPs in ClinVar using a combination of SIFT, PolyPhen-2, Condel, PROVEAN, PANTHER, Consurf, and I-Mutant.
The GWAS database was used to identify nsSNPs associated with cancer risks as it is the most extensive SNPs database 20 . We only focused on nsSNPs as they are capable of altering protein function, structure, conformation, and interaction which cause the increased risk of cancer [8][9][10][56][57][58] . Out of the 80 nsSNPs associated with cancer risks from the GWAS dataset, a total of 52 nsSNPs were identified among the Orang Asli and Malays (43 in Orang Asli and 43 in Malays). They were subjected for further analysis.
Hence, we conducted the concordance analysis with SIFT, PolyPhen-2, Condel, PROVEAN, PANTHER, Consurf, I-Mutant, ModPred, and MutPred tools to predict the most deleterious nsSNPs among the Orang Asli and Malays (Table 3). From the functional effect prediction analysis, a total of 8 out of 52 nsSNPs which were associated with cancers from both populations were identified as the most deleterious nsSNPs by SIFT, PolyPhen-2, Condel, PROVEAN, and PANTHER ( Table 3). The most deleterious nsSNPs were identified based on the criteria that at least four scores out of five algorithmic tools used were significant, which are score < 0.05 in SIFT, > 0.9 in PolyPhen-2, < − 2.5 in PROVEAN, 1.0 in Condel, and > 450 million years in PANTHER. The identified nsSNPs were rs3124765 (CACFD1), rs9379084 (RREB1), rs10936600 (ETFA) rs1801591 (LRRC34), rs117744081 (CPVL), rs2277283 (INCENP), rs757978 , (FARP2) and rs1126809. (TYR). In terms of the useability of these five tools for prediction, different algorithms for evolutionary conservation, protein function or structure, alignment, and measurement of similarity between variant sequences and protein sequence homologs were analyzed. Hasan et al., 59 had reported that the combination of the best individual tools, FATHMM, iFish, and Mutation Assessor, in one classifier called Meta (Combined Scores through J48 "CSTJ48") enhances the predictive power of these tools. However, no specific classifier outperforms overall datasets in pathogenic predictability. Additionally, these tools have proven performance in identifying deleterious nsSNPs 60,61 , and these make them useful for our study. Thus, these eight (8) nsSNPs identified were further investigated.
The Consurf server had predicted the eight (8) variations, D1171N, I58M, L286, T171I, Y168H, M506T, R402Q, and T260N, were highly conserved (Table 4), and this emphasizes their functional and structural importance. Evolutionary information is essential to understand the mutations potentially affect human health 26 . The evolution of amino acids influence their properties such as size, shape, hydrophobicity, and charge of amino acids at the molecular level 62 . For example, 53 missense mutations that caused cystic fibrosis were found within highly conserved positions. These regions were significant for conserving the structural and functional integrity of the CFTR protein 63 . Besides, functional sites of proteins like DNA interaction sites, protein-protein interaction sites, and enzymatic sites are essential for biological functions 64,65 . This may suggest that the nsSNPs found in these conserved regions have higher deleterious effects than other non-conservative nsSNPs and may significantly affected the biological functions 66 . The findings further indicated that these eight (8) high-risk nsSNPs were indeed deleterious to the protein functions and structures.
I-Mutant predicts the protein stability of mutants based on the free energy change value (ΔΔG) and reliability index (RI). I-Mutant predicted 6 out of 8 variants (rs3124765, rs9379084, rs10936600, rs2277283, rs757978, and rs1126809) to have decreased stability. Protein stability is important for the protein structural and functional behavior 67 . Protein stability affects the conformational structure of the protein, such as protein misfolding, aggregation, and degradation, and thus determines its function 67,68 . From the results, we believe that the six variants might had affected the proteins function by affecting their stability.
For structural analysis, the six native and mutant protein structures (CACFD1, RREB1, LRRC34, INCENP, FARP2, and TYR) were successfully generated using I-TASSER as there are no available close homologous templates. I-TASSER generates full-length models by the iterative structural fragment reassembly method, which consistently drives the threading alignment relative to the native state. They were then verified by ERRAT and Ramachandran Plot Server, which proved the stability, reliability, and consistency of the tertiary structures of the proteins. The three-dimensional structures for the native and mutant proteins predicted by I-TASSER clearly revealed the structural changes resulting from amino acids substitutions (Fig. 2). Furthermore, the changes predicted on the sequence-based homology modeling between the native and mutant on the LRRC34, FARP2, and TYR proteins, support the prediction of the pathogenicity of the deleterious substitutions.
TM-align were utilized to calculate the comparison between the predicted native and mutant protein structures based on TM-score and RMSD value. In most cases, common protein structure modeling tools may construct realistic full-length models with an RMSD value less than 6.5 Å if alignment has a TM-score of more than 0.5 69 . Following the criteria of RMSD < 6.5 Å and TM-score > 0. 5 www.nature.com/scientificreports/ randomly chosen unrelated proteins, meaning that those models were generated from random proteins and had different folding compared to the native protein 49 . Hence, we finally selected only three mutants, L286I (LRRC34), T260N (FARP2), and R402Q (TYR), those with a score higher than 0.5 and which generally assumed the same fold in SCOP/CATH (Table 5). Several studies have shown the importance of using various bioinformatics tools to determine the phenotypic changes and protein function associated with the structure-function relationship of various genes and proteins 70,71 . These studies may provide novel therapeutic markers for a variety of diseases. The three shortlisted nsSNPs were submitted to MutPred2, HOPE, and ModPred tools to predict the modification of structural and functional protein structures. MutPred2 predicts the modification of structural and functional protein structures, including the altered ordered or disordered interface, transmembrane protein, metal binding, DNA binding, loss of allosteric site, and gain of allosteric site. HOPE was used to further explore the structural effects of these three amino acid substitutions. It was shown that the substitution of L286, T260, and R402 were highly conserved, and they are likely to damage the structures. Based on Fig. 3, the substitution of L286, T260, and R402 caused changes to the LRRC34, FARP2, and TYR protein structures. Modification of protein charge, mass, and hydrophobicity are known to affect the networks of protein-protein interactions 72,73 . Thus, those modifications can alter the ability of proteins to interact with other proteins. Based on these predictions, we believed that several nsSNPs might cause the functional and structural alterations of these proteins and be responsible for the increased risks of cancer. ModPred tools predict possible post-translational modification (PTM) sites to investigate the effects of PTMs further. ModPred tool had predicted proteolytic cleavage sites of the substituted amino acids L286I, T260N, and R402Q in LRRC34, FARP, and TYR proteins, respectively (Table 7). Proteolytic cleavage is a PTM that induces activation, inactivation, fully changed protein structure, excision of new N or C termini with growth factor activity from the parent molecule of an extracellular matrix and regulates a vast range of biological processes. These involve DNA replication, cell proliferation, cell cycle progression, and cells death, as well as inflammatory processes such as arthritis, cancer, cardiovascular disease, and inflammation. This represents a remarkably significant prediction by ModPred ( Table 7). The function or structural changes in TYR protein (rs1126809) has been associated with basal cell carcinoma or squamous cell carcinoma. The TYR protein is vital for the production of an enzyme called tyrosinase, which catalyzes the conversion of tyrosine to dopachrome in melanin biosynthesis 74 . We believed that the changes at the PTM site caused by rs1126809 variant of tyrosinase might lead to dysregulation of melanin synthesis within the melanosomes. This resulted in the variation in skin pigmentation, which may lead to basal cell carcinoma or squamous cell carcinoma. As for LRRC34 and FARP2 proteins, the scores given by ModPred for this PTM was very low for proteolytic cleavage ( Table 7). The LRRC34 is a nucleolar protein that plays a role in the ribosome biogenesis of pluripotent stem cells. Mutations in some of the related proteins or modifications at ribosome biogenesis may result in severe implications for the organism, depending on the degree of the modification and the involvement of the tissue 75 . The changes at the PTM site might alter the structure of LRRC34 protein, which may lead to multiple myeloma. For example, impaired or modified ribosome synthesis due to the mutation of the ribosomal proteins was reported in many cancers such as chronic lymphocytic leukemia, colorectal cancers, and glioma 76 . FARP2 has been reported as a potential regulator of chronic lymphocytic leukemia pathogenesis that influences protein activity encoded by MYC gene. MYC gene is known as a proto-oncogene and produces a nuclear phosphoprotein that plays a role in the cell cycle progression, apoptosis, and cell transformation. The mutation may disrupt the MYC protein activity. Although the effect of modification at proteolytic cleavage sites on these proteins has still not been published, numerous studies have shown that this alteration can significantly change the protein function by modifying its position, stability, or inter-protein interactions others 77 . Proteolytic cleavage of modified residues in the protein may be necessary for some of the essential functions of the protein. Besides, those nsSNPs can disrupt proteins that could probably increase the damage caused by PTM impairment.
Protein-protein interaction network analysis showed the interactions of LRRC34, FARP2, and TYR with ten different proteins. This analysis is important in predicting the functionality of interacting genes or proteins and understanding the functional relationships and evolutionary conservation of the interactions among the genes. Besides, our literature search demonstrated that LRRC34, FARP2, and TYR interact with other proteins. LRRC34 interacts with two major nucleolar proteins, Nucleophosmin (NPM1) and Nucleolin (NCL), in ribosome biogenesis of pluripotent stem cells 78 . The mutation in LRRC34 might affects ribosome biogenesis and lead to tumorigenesis. FARP2 interacts with PLXN4, SEMA3A, and NRP1 in Sema3A-Nrp1/PlxnA4 signaling pathway that controls dendritic morphogenesis 79 . The mutation in FARP2 might disrupt the formation of axonal and dendritic morphologies for the neurodevelopment that ultimately lead to risks of cancers. TYR interacts with TH, MITF, and PAH in the melanogenesis pathway 80 . Due to the nonsynonymous mutation in TYR, the melanin synthesis might be disrupted, leading to tumorigenesis. Therefore, any changes in these protein function/structure would have an impact on many disease pathways.
The structural analysis was performed by using molecular docking. The study aims to identify the correct poses of ligands in the binding pocket of a protein and to predict the affinity between the ligand and the protein, which may enhance or inhibit its biological function 81 .
The molecular docking analysis of SRC, DCT, and MYNN with native and mutant FARP2, TYR, and LRRC34 modeled structures showed a difference in binding affinity, reduction in the number of hydrogen bonds with residues in mutant proteins (Table 8), and a significant deviation between native and mutant protein complexes (Fig. 5), respectively. SRC proto-oncogene plays an essential role in development, growth, progression, and metastasis of some human cancers, including those of the colon, breast, pancreas, and brain [82][83][84][85] . FARP2 were identified as guanine nucleotide exchange factors (GEFs) for RhoGTPases that play regulatory roles in neuronal development, and several studies have revealed the genetic alterations in Ras homologous RhoGEFs in several human cancers [86][87][88] . Thus, the deviation observed in the bound SRC molecule with mutant FARP2 protein might disrupt the protein interaction, leading to cancers. A previous study had reported that mutations of melanogenic enzyme tyrosinase (TYR) result in hypopigmentation of the hair, skin and eyes 74  www.nature.com/scientificreports/ the related enzymes that catalyzes different post-TYR reactions in melanin biosynthesis. TYR and DCT also have been proposed to interact with and stabilize each other in multi-enzyme complexes 80 . Thus, the deviation observed in the bound DCT molecule can reduce the catalytic efficiency of TYR. LRRC34 is a member of the leucine-rich repeat-containing protein family that has been suggested to be implicated in the maintenance and regulation of pluripotency. MYNN protein is a member of the BTB/POZ and zinc finger containing family involved in transcriptional regulation. It has also been shown to interact with a few other proteins, including LRRC34, which are part of the transcription factors that participate in DNA repair 89 . A study showed that disruption of LRRC34 protein function could result in reduced expression of some pluripotency genes. Its altered expression impacts the pluripotency-regulating genes and interacts with other proteins known to be involved in ribosome biogenesis 78 . This molecular docking analysis further evaluates our hypothesis as to whether T260N, R402Q, and L238I mutants have deleterious effects on FARP2, TYR, and LRRC34 proteins, respectively. The most prominent change was noticed in T260N, R402Q, and L238I, where a significant loss of H-bond interactions within the binding pocket residues can be observed compared to that in the native protein. These H-bonds were disrupted when the amino acid in mutants was replaced with other amino acids, which altered the binding affinity. The change in the number of hydrogen bonds indicates the deleterious effect of amino-acid substitution. Therefore, an increase or decrease of hydrogen bonds of the native form could destabilize the protein and affect protein functions [90][91][92][93] . As a result, genetic mutation which alters the protein structure, therefore influences how the protein interacts with its ligands, potentially leading to a disease condition. This method has previously been used to discover functionally significant variants that may play a role in disease mechanisms 70,94,95 . Molecular docking analysis conducted in this study revealed that T260N, R402Q, and L238I mutants could significantly affect the functional activity of FARP2, TYR, and LRRC34 proteins, respectively.

Conclusion
With the advancement of genomics, predicting and preventing diseases that are preventable will definitely bring a new facet to medical practice. We had illustrated that with the availability of a local genome database, we could predict disease risks in our population using a validated bioinformatics pipeline and the established GWAS and ClinVar database. The pipeline will help strategize experimental research to prioritize studies on the SNPs with predicted functional impact as thousands and millions of SNPs with unknown functions are detected using whole-genome sequencing technologies. In this study, a bioinformatics pipeline was developed and validated to predict the effects of nsSNPs, rs1126809, rs757978, and rs10936600 on the functional and structural changes on TYR, FARP2, and LRRC34 proteins, respectively. The analysis also provides significant insight into the deleterious effects of these nsSNPs on the protein structures.
These three (3) nsSNPs were predicted to confer high risks of multiple myeloma, chronic lymphocytic leukemia, and basal cell carcinoma or squamous cell carcinoma in the Orang Asli and Malays population. The prediction pipeline developed in this study helps to reduce the number of extensive investigations and wet lab experiments which are required to explain the impacts of these nsSNPs on the structures and functions of these proteins. We intend to analyze further the risks conferred by these SNPs in the cancer patients in the local population.
We believed that a similar approach could be used to develop and validate bioinformatics pipelines in annotating and predicting the functional effects of SNPs related to other diseases. This study also allows us to establish a database of predicted phenotypes based on the new SNPs identified in our population.