Exploration of the effect of sequence variations located inside the binding pocket of HIV-1 and HIV-2 proteases

HIV-2 protease (PR2) is naturally resistant to most FDA (Food and Drug Administration)-approved HIV-1 protease inhibitors (PIs), a major antiretroviral class. In this study, we compared the PR1 and PR2 binding pockets extracted from structures complexed with 12 ligands. The comparison of PR1 and PR2 pocket properties showed that bound PR2 pockets were more hydrophobic with more oxygen atoms and fewer nitrogen atoms than PR1 pockets. The structural comparison of PR1 and PR2 pockets highlighted structural changes induced by their sequence variations and that were consistent with these property changes. Specifically, substitutions at residues 31, 46, and 82 induced structural changes in their main-chain atoms that could affect PI binding in PR2. In addition, the modelling of PR1 mutant structures containing V32I and L76M substitutions revealed a cooperative mechanism leading to structural deformation of flap-residue 45 that could modify PR2 flexibility. Our results suggest that substitutions in the PR1 and PR2 pockets can modify PI binding and flap flexibility, which could underlie PR2 resistance against PIs. These results provide new insights concerning the structural changes induced by PR1 and PR2 pocket variation changes, improving the understanding of the atomic mechanism of PR2 resistance to PIs.

In this study, PR1 and PR2 pockets were characterized using a set of 51 physicochemical and geometric descriptors [1].
• Fourteen descriptors characterized the geometry of the pockets: • Thirty-seven descriptors characterized the physicochemical properties of pockets:  Training of the RF model. RF model was computed using the bound PR pockets in order to predict the type of pockets, either PR1 or PR2 pockets. This was accomplished using their 42 physicochemical and geometric descriptor values and the randomForest package within R [3,4]. The ntree-i.e., number of trees in the forest-and mtry-i.e., number of variables randomly sampled as candidates at each split -values were fixed at 200 and 12, respectively, after an optimization step using a grid approach. Thus, the generation of each tree tested 12 pocket descriptors at each split and was generated to create nodes with the best classification of the training sample. The process was repeated to generate 200 trees, which made up the forest, named RF PR1-PR2 .
RF PR1-PR2 model performance. The performance of the RF model was quantified using the OOB error rate, which reports the proportion of samples that are classified incorrectly on average across the trees using the OOB sample. As the RF model is a machine-learning approach based on a double-random process, we tested the robustness of the RF PR1-PR2 model's performances. We computed the OOB errors of 500 RF models, named !"!!!"! !"" models, trained using the same data To test the robustness of the important descriptor selection, we analysed the most important descriptors in the !"!!!"! !"" models. We selected the ten descriptors with the highest importance scores in each !"!!!"! !"" model. Then, we computed the fraction of models wherein the descriptor was selected as one of the ten most important descriptors, named fMVI, for each descriptor. The important descriptor selection is robust if most models yield the same important descriptors.
In the final step, we analysed the significance of the selected important descriptors by computing the importance p-value for each descriptor [5]. We computed the importance score of each descriptor in the 500 RF permuted models. This led to a vector of 500 importance measures for every variable, which we called the null importances. We then computed a non-parametric estimation of the importance p-value, named the pvalue IMP , by determining the fraction of null importances that were more extreme than the importance of the RF PR1-PR2 model. A descriptor was selected as important for separating PR1 and PR2 pockets if it had an importance score in the RF PR1-PR2 model higher than 0.5, a fMVI value higher than 90% and a pvalue IMP smaller than 0.05.

Results
Training of RF model. To identify the most important descriptor for separating PR1 and PR2 bound pockets, we trained an RF model on the 24 bound pockets characterized by the 42 descriptors. The obtained RF model, named RF PR1-PR2 , exhibited an OOB error rate of 0.04. We tested the robustness and the validity of the RF PR1-PR2 model by comparing the performances of RF PR1-PR2 with those of the !"!!!"! !"" and RF permuted models (Table S2). We observed that !"!!!"! !"" models also exhibited very good performances, with OOB error rates ranging from 0 to 0.08 with an average error of 0.003 ± 0.011. In contrast, the RF permuted models showed bad performances with    Among the 42 pocket descriptors involved in the RF PR1-PR2 model, eight had an importance score higher than 0.5: p_hydrophobic_residues, p_tiny residues, p_O_atom, p_oxygen_atom, hydrophobibity_kyte, p_nitrogen_atom, p_Nlys_atom, and p_hyd_atom ( Figure RF-desc). We noted that these eight descriptors were selected as the most important descriptors in 95% of the !"!!!"! !"" models (fMVI > 95%). This result indicates that the selection step of the important descriptors was robust. In addition, six of them had a significant p-value (pvalue IMP < 0.05) ( Figure   fig:RF-desc). These results collectively show that these six descriptors are able to separate PR1 and PR2 pockets. They characterize the pocket hydrophobicity (p_hydrophobic_residues and hydrophobibity_kyte), the composition of certain atoms (oxygen [p_oxygen_atom] and nitrogen [p_nitrogen_atom and p_Nlys_atom]) and tiny residues (p_tiny residues).