Machine learning prediction of antiviral-HPV protein interactions for anti-HPV pharmacotherapy

Persistent infection with high-risk types Human Papillomavirus could cause diseases including cervical cancers and oropharyngeal cancers. Nonetheless, so far there is no effective pharmacotherapy for treating the infection from high-risk HPV types, and hence it remains to be a severe threat to the health of female. Based on drug repositioning strategy, we trained and benchmarked multiple machine learning models so as to predict potential effective antiviral drugs for HPV infection in this work. Through optimizing models, measuring models’ predictive performance using 182 pairs of antiviral-target interaction dataset which were all approved by the United States Food and Drug Administration, and benchmarking different models’ predictive performance, we identified the optimized Support Vector Machine and K-Nearest Neighbor classifier with high precision score were the best two predictors (0.80 and 0.85 respectively) amongst classifiers of Support Vector Machine, Random forest, Adaboost, Naïve Bayes, K-Nearest Neighbors, and Logistic regression classifier. We applied these two predictors together and successfully predicted 57 pairs of antiviral-HPV protein interactions from 864 pairs of antiviral-HPV protein associations. Our work provided good drug candidates for anti-HPV drug discovery. So far as we know, we are the first one to conduct such HPV-oriented computational drug repositioning study.


Scientific Reports
| (2021) 11:24367 | https://doi.org/10.1038/s41598-021-03000-9 www.nature.com/scientificreports/ 3D structural conformations have been revealed through approaches of in silico simulation 9 and structural biology 10 . Other studies have tested, discussed, and reviewed the in vitro effects of existed drug, i.e., the Human Immunodeficiency Virus (HIV) protease inhibitor, on HPV proteins and cells infected with HPVs [11][12][13][14][15] . These reports targeting existed drugs for HPV treatments showed that, compared with de novo drug discovery, repositioning exited drugs is indeed the better and quicker strategy. Nonetheless, drug efficacies from above evidences were moderate and no further progress is seen in later stages, e.g., in clinical contexts. And hence, above research progresses are yet far from being able to identify drug candidates with good therapeutic and anti-HPV potential. Limitation of them could be due to two reasons. One is that inappropriate compound or drug candidates have been chosen for testing. The other reason could be the number of drug candidates to be tested is too small. Testing only limited numbers of compound or drug candidates surely restricts the probability of identifying those appropriate ones.
In order to meet the urgent needs for effective anti-HPV drug discovery, based on target-oriented drug repositioning strategy, we collected and analyzed 96 antiviral drugs to do the relatively large-scale in silico screening for 9 HPV-16 proteins, so as to computationally and effectively identify effectively antivirals with good potential for targeting HPV proteins. Briefly, in this work, we constructed, benchmarked, and selected machine learning predictive models (also known as predictors) to predict antivirals that could have potential interactions with HPV proteins. This is because drug-target interactions are vital prerequisite of molecular therapeutic mechanisms. Through benchmarking, we selected the high-precision K-Nearest Neighbor (KNN) 16 and Support vector machine (SVM) 17 predictors to detect those confidence interaction pairs of antiviral-HPV protein.
To the best of our knowledge, no prior study similar to our work has been done. Lots of researchers predicted targets of drugs, compound-protein interactions, or protein-protein interactions using machine learning or other computational methods [18][19][20][21] . However, so far as we know, no study has focused on studying relationships between antiviral drugs and HPV proteins.

Methods
Research question formulation. Theoretically, a therapeutic target and its drug molecule have interactive binding relation to each other. Therefore, trying to identify potential HPV protein targets of antivirals could be considered as a binary classification task, i.e., to predictively classify proteome of HPV into two classes of proteins. One class is HPV proteins which have potential interaction with drug molecules, and the other class is HPV proteins do not have potential interactions with drugs. Machine learning is state-of-art method to solve such binary classifications (Fig. 1). Considered that known antiviral drug-target interaction pairs were available, which could serve as the known-label validation dataset, we thus chose supervised (machine) learning methods for this study. Research framework of this study. Predicting antiviral drug-HPV protein interaction could be considered a binary classification task, and machine learning is a good method for such task. In this work, antiviral drug-target pairs' features were transformed into vectors for constructing machine learning predictors. Through benchmarking, the best predictors were selected to predict antiviral-HPV protein interactions.  25 and Therapeutic Targets 26 databases (As of 19th July 2020). Drug-target interaction pairs which contained the United States Food and Drug Administration (FDA)approved antiviral drugs were treated as the validation dataset for machine learning, because FDA-approved antivirals as the validation set can better reflect the real-world application value of our models. And the rest drug-target interaction pairs were treated as the training dataset for machine learning. In this work's machine learning classification task, an interaction pair of an antiviral drug and a protein was defined to be a positive instance, while negative instance indicated a non-interactive pair of antiviral and protein. In order to balance data ratio for binary machine learning classification task. We randomly generated non-interactive drug-target pairs so as to assure the 1:1 ratio of positive instances to negative instances for machine learning. In more details, we initially constructed a full graph of bipartite drug-target network, in which each antiviral was connected to all the target proteins in the network. Upon removing those known antiviral-target interaction pairs, we had those non-interactive drug-target pairs. And then, we randomly drew such non-interactive drug-target pair out without replacement (treated it as the negative instances for machine learning) until the ratio of positive to negative instance reached 1:1. Next, we integrated the proteome (9 proteins in total) of high-risk HPV-16 subtype and all the antiviral drugs to form drug-protein interaction prediction dataset. See Supplementary Table S1 for machine learning training dataset of antiviral drug-target interaction pairs, Supplementary Table S2 for drugtarget interaction pair dataset used in machine learning validation process, and Supplementary Table S3 for Uniprot's HPV-16 proteome, i.e., 9 proteins. Next, all antivirals' molecular structures were analyzed using ChemmineR 27 and 1024-dimension chemical fingerprint datasets were generated through R scripting 28 . All proteins were analyzed using ProtR 29 and 10,784 high dimension protein descriptor feature datasets were generated. As seen in Table 1. Descriptors used were protein structural and physicochemical properties. These descriptors have been widely used in studying protein-protein interactions and protein-ligand interactions in silico, and they worked well 29 .
All datasets were integrated, scaled and normalized using R computing environment 28 .
Machine learning and prediction. Briefly, the machine learning processes of this research work followed such order and general steps. Initially, the training dataset was loaded to different machine learning algorithms, and fivefold cross validation and grid searching were applied to training processes, so as to identify the best parameters of machine learning models with the best predictive power. Later, predictors with good performances were further applied to classify the validation dataset with known labels. Lastly, the verified best predictor was used to predict antiviral-HPV protein interaction pairs. Diverse sorts of supervised learning algorithms with different purposes exist. Amongst, the Support Vector Machine, Random Forest 30 , Logistic Regression 31 , etc., are classic algorithms for tackling the binary classification questions. 6 types of machine learning classifiers friendly for binary classification were chosen for building predictive models. The chosen predictors were Support Vector Machine, Random Forest, AdaBoost 32 , Logistic Regression, Naïve Bayes 33 and K-Nearest Neighbor classifier. Amongst, K-Nearest Neighbor classifier and Adaboost displayed good prediction performances on predicting miRNA-disease associations 34,35 . And Chen et al. developed a Random Forest-based model RFMDA which had good predictive power on multiple kinds of human complex diseases 36 . These studies support us to choose aforementioned predictors for this work.
With default parameters, 6 predictors went over simple checking through quick training and performance measurement. At this early stage, as expected, all predictors did not perform well. Subsequently, in order to identify better parameters for predictors, grid search fivefold cross validations and performance benchmarking were conducted. The predictive performance of 6 different predictors with better parameters were tested using known-label validation dataset. Upon checking performance of different predictors, we selected the optimized K-Nearest Neighbors classifier and SVM, which had the highest precision scores and were the most appropriate www.nature.com/scientificreports/ predictor to identify high confidence drug-protein interaction pairs from 864 pairs of antiviral-HPV protein associations.
Aforementioned data processing and machine learning computations were done via in-house scripts of Python 37 and R 28 . Libraries and modules used were Sci-kit learn 38 , Pandas 39 , Numpy 40,41 Scipy 42 , and also Bioconductor 43,44 and Biomart 45 .We acknowledge the authors and developers of these computational tools.
Specifically, parameter set of KNN that finally used for predicting antiviral-HPV protein interaction pair was that, the number of neighbors was set to 65, "weights" was set to "distance", and "leaf_size" was set to 60. And for SVM, gamma was set to 0.001, C (the regularization parameter) was set to 0.0002, and polynomial kernel with degree = 3 was used to predict antiviral-HPV protein interaction pair. The rest parameters remained default ones of the function of Python library Sci-kit learn 38 .

Results
Dataset overview. The antiviral drugs and their associated targets were retrieved and analyzed as described in method section. Table 2 provides a summary of our dataset. We had totally 61 antiviral drugs, which formed totally 284 antiviral-target interaction pairs with their targets. For the purpose of measuring machine learning predictors' performance, antiviral-target interactions were split into two classes where 102 pairs were used as the dataset for training or fitting machine learning predictors, and the rest 182 pairs were treated as dataset for validating the predictive performances of machine learning predictors. And we also compiled 9 proteins of HPV (its complete proteome) with 96 antiviral drugs to form 864 pairs of antiviral-HPV protein association pairs ( Table 2).
Performances of machine learning models. Initially, we chose 6 types of machine learning models and applied fivefold cross validation strategy to fit the antiviral-target interaction training dataset. A primary benchmarking of the predictive performance of 6 chosen predictors was as seen in Table 3.
Briefly, all predictors' predictive performances were less satisfying, as expected. SVM with default parameter (RBF kernel) performed the worst in all sorts of metrics among 6 predictors. AdaBoost classifier scored the best in terms of precision score but had the lowest recall score. F1-measure is the harmonic average of precision and recall. The highest F1-measure was found from the Random forest classifier, which was 0.63. While we also found other metrics of Random forest were not high. All its metrics were around 0.65 though the values were close to each other. The highest accuracy score and AUC (Area Under Curve of Receiver-Operating Characteristic Curve) of 6 predictors' were 0.66 and 0.68, respectively. And both of them were also found in Random forest's performance. Metrics of default parameters' KNN were all around 0.6, indicating its unsatisfying performances in fivefold cross validation, too. Similar to KNN, Naïve Bayes classifier did not perform well, and one common point of KNN and Naïve Bayes classifier was that, the value of their recall score was higher than those of other metrics (Table 3).
Next, we tuned parameters of predictors through grid searching fivefold cross validation, and tested how combination set of parameters affected predictors' predictive performances on known-label validation dataset. At the beginning, we focused on optimizing predictors for obtaining better values of comprehensive metrics, Table 2. Summary of antiviral-target and antiviral-HPV protein interaction dataset used in machine learning processing of this study. a Validation set consisted of U.S.FDA-approved antiviral drugs and these drugs' binding target proteins. b 9 proteins of HPV-16. c Ratio of positive instance to negative instance was 1:1. d Number of validation set was greater than that of training set because (1) more FDA-approved antivirals were desired for validating the real-world application value of our machine learning models; (2) generalization performance of machine learning models could be reflected using smaller training set but larger validation set.  www.nature.com/scientificreports/ such as the F1-measure, accuracy or AUC value. Despite a great number of times' trying, no high sores of aforementioned F1-measure, accuracy or AUC metric value was seen. Given that high precision score indicates the low number of predictive false positive instances, and high recall score indicates the low number of predictive false negative instances, we changed our strategy and decided to do high precision-oriented optimization. This was because the purpose of this work was to identify antivirals that interact with HPV proteins. To this end, using high-precision predictor, predictive positive instances could have lower false positive instances mixed inside. Therefore, in this work, we preferred precision metric over recall metric for selecting appropriate predictors to predict antiviral-HPV protein interactions (positive instances). Through benchmarking the performances of predictors, we found optimized SVM and KNN predictors had better precision scores than others. SVM's was 0.8 and the KNN classifier's was 0.85 (Table 4). We hence used them for prediction task and we chose the intersection of their prediction results as the final results.
Predicted antiviral-HPV protein interaction pairs. Upon selection of high-precision predictors, we applied them to predict the antiviral-HPV protein interactions. We selected two predictors' result intersection as the final prediction result, i.e., we only consider an antiviral-HPV protein association pair has potential interaction if both predictors predicted this pair to be position (interactive). As a result, within 864 antiviral-HPV protein association pairs, most antiviral-HPV protein pairs were predicted to be negative, i.e., the antiviral drug does not interact with the HPV protein. Only a small portion, i.e., 57 of antiviral-HPV protein pairs were predicted to have interaction. Prediction results were summarized in Table 5 in HPV protein-oriented form. Full prediction results could be found in Supplementary Table S3. Here we took the Docosanol as an example for analysis. The drug Docosanol was predicted to interact with HPV-16's protein E7 using our high-precision machine learning predictors. Docosanol is a U.S. FDA-approved antiviral drug targeting Envelope glycoprotein GP350 and GP340 of Epstein-Barr Virus (EBV, also known as Human Herpesvirus or HHV-4) and it is used to treat fever blisters, etc. Interestingly, through literature survey, a recently published clinical case report was found to claim that, the mixture usage of Docosanol, curcumin, and other drugs together treated HPV infection and vaginal warts of a patient well 46 . This could be evidence supporting our predictive result about Docosanol and HPV protein. HPV protein-oriented antiviral prediction results were summarized in Table 5 and brief description of the example antiviral drugs, protein targets of the antiviral drug and relevant therapeutic indications were also listed in Table 5.

Discussion
While our results are to be validated by in vitro assays, in this work, we constructed machine learning models, and predicted antiviral-HPV protein interactions so as to identify potential drug candidates targeting HPV proteins. The high-risk types of HPV are not limited to HPV-16. There are other types such as HPV-18. Indeed, we are not only able to apply the research framework of this study to predict the potential drug candidates for the proteome of other HPV subtypes, but also to other types of pathogenic and infectious microbes, as well. www.nature.com/scientificreports/ Reviewing this current study, we found several significant points that could help us do better preparation for further works. Initially, in this work, though we tried our best to collect more antiviral drugs, due to the availability of antiviral drugs, we had limited size of dataset for machine learning. This could be one of factors why we did not obtain predictors with high scores of F1-measure, accuracy, or AUC. Compared with antivirals, the amount of other types of drugs, e.g., cancer drugs or antibiotics, is higher. Thus, in future studies, we would consider using other types of drugs for repositioning purpose.
Also, the final predictors selected did not have high F1-measure, accuracy, or AUC. Because current machine learning processes are black box which is difficult to interpret. Alternatively, in this study, considered the tradeoff between precision and recall, we chose to select the intersected prediction results from two high-precision predictors in order to get higher confidence antiviral-HPV protein interactions. For future study, we would learn and try to apply the state-of-art explainable machine learning methods which may be interpretable. In such case, we may be able to find out reasons causing low performances and obtaining guidance for model optimizations and obtaining more powerful machine learning predictors. One more interesting idea for extending current work is to predict synergistic antiviral drug combinations for HPV infection pharmacotherapy. Similar to "cocktail" treatment for HIV infections and synergistic treatment for fungal infections, it is likely that synergistic drug combinations work for treating HPV infections, too. A good example to get insights from is NLLSS 47 , which is a well-performed algorithm for predicting antifungal synergistic drug combinations. Similarly, it is a computational and machine learning-based research work, and hence multiple points, such as its research ideas and methodology, could be referred to.

Conclusions
Inspired by the needs of anti-HPV drug discovery, drug repositioning and computational analytics, we designed this research project and constructed machine learning models to predict possible antiviral-HPV protein interactions so as to identify potential pharmacotherapy for HPV infection. As a result, we optimized the predictors and identified 57 antiviral-HPV protein interaction pairs.
To the best of our knowledge, we are the first pioneer to conduct this HPV-oriented computational antiviral repositioning study. No similar study has been found so far. Therefore, our work provides good insights to virologists, medicinal chemists, gynecologists, clinical microbiologists, etc., those who are interested in the treatment and therapy of HPV infections. Also, drug candidates pre-selected via computational analytic screening could have lower probability of ineffectiveness than those that did not go through computational analyses. It thus could save resources, and antivirals identified by us could be good candidates for further in vitro and in vivo tests. In such way, this work contributes to drug development for HPV infections. What is more, our predicted antiviral-HPV protein interaction pairs also offer insights for fundamental biomedical research on drug-protein interactions or molecular interaction mechanisms. The last but not the least, the research framework of this study, i.e., the machine learning-based compound-protein interaction prediction, could also be applied to primary drug repositioning or drug discovery for those diseases or infectious microbial pathogens lacking effective pharmacotherapy. E.g., the Noroviurs and COVID-19.

Data availability
Data of this study were included in the supplementary materials.