VHLdb: A database of von Hippel-Lindau protein interactors and mutations

Mutations in von Hippel-Lindau tumor suppressor protein (pVHL) predispose to develop tumors affecting specific target organs, such as the retina, epididymis, adrenal glands, pancreas and kidneys. Currently, more than 400 pVHL interacting proteins are either described in the literature or predicted in public databases. This data is scattered among several different sources, slowing down the comprehension of pVHL’s biological role. Here we present VHLdb, a novel database collecting available interaction and mutation data on pVHL to provide novel integrated annotations. In VHLdb, pVHL interactors are organized according to two annotation levels, manual and automatic. Mutation data are easily accessible and a novel visualization tool has been implemented. A user-friendly feedback function to improve database content through community-driven curation is also provided. VHLdb presently contains 478 interactors, of which 117 have been manually curated, and 1,074 mutations. This makes it the largest available database for pVHL-related information. VHLdb is available from URL: http://vhldb.bio.unipd.it/.

Scientific RepoRts | 6:31128 | DOI: 10.1038/srep31128 in p53 tumor suppressor regulation 27,28 . Kidney-specific pVHL inactivation causes the development of kidney cysts in a mouse model 29 , while reintroduction of a wild type gene interrupts malignant progression 30 . A number of experimental and in silico data of proteins involved in pVHL tumorigenesis is reported 9,13 and contained in large databases, such as IntAct 18 , STRING 31 and BioGRID 32 . It is thought that pVHL has at least four different protein-protein interaction interfaces (A to D) 13 . Several specific interactors were found for each interface and correlation with functions other than oxygen sensing, such as DNA-damage repair 33 , microtubule dynamics 34 and oxidative metabolism, reinforce the pivotal role of pVHL. As the amount of details known about pVHL function is rapidly increasing, the multiple pVHL roles may confound our understanding of this complex protein.
Knowledge is usually derived from freely accessible protein sequence and function databases. Although valuable, these universal resources are generalist by design, yielding a strong fragmentation of the huge amount of pVHL data. For a non-bioinformatician, scattered information represents one of the biggest hurdles, slowing down a holistic understanding of the pVHL biological role. Here we present VHLdb, a novel resource providing expert curation for the pVHL tumor suppressor. The database was primarily designed to be effective for a non-expert, making information retrieval easier. Overall, VHLdb accounts for 478 unique interactors in two curation levels (manual and automatic), with data retrieved from different sources. Detailed information on the pVHL interaction interface and post-translational modifications were also included. A feedback function allows inclusion of novel information from experts in the field wishing to contribute annotation on interactors or mutations. Finally, a downloading tool is also provided for data sharing.

Database Description
Mutation data. Germline and somatic mutations have been collected from [35][36][37][38] , integrated [39][40][41] and annotated with predictions on protein stability. The final dataset is made up of 1,074 mutations and, to the best of our knowledge, represents the largest publicly available repository of pathogenic pVHL variants. An example of mutation details is given in Table 1 and Fig. 1. Where possible, a pVHL interacting surface has been defined for each mutation. E.g. frameshift mutations cannot be assigned to any surface due to their intrinsic nature. Solvent accessibility has been computed for each mutation using DSSP 42 and mutated residues are defined exposed when at least 20% of their surface is accessible to solvent. Bluues 43 and NeEMO 44 have been run on all possible mutations, using the pVHL 3D structure with PDB code 1LM8 as reference. Current pVHL 3D structures cover only the structured part of the protein (i.e. alpha-and beta-domains), lacking the first 60 residues which form an intrinsically disordered tail. Pathogenicity assessment for mutations in this segment was not included in VHLdb to avoid the risk of erroneous interpretation from low confidence predictions. Bluues 43 calculates the electrostatic properties of a protein and is able to predict electrostatic properties of mutated solvent exposed residues. NeEMO 44 evaluates stability changes caused by amino acid substitutions using a machine learning based approach from structure. It has been run on all point mutations of the crystallized protein, i.e. again excluding only the N-terminus. pVHL interactome. The pVHL interactome has been defined starting from searches in publicly available databases. VHLdb contains two levels of annotation for interactors, automatic and manual (Fig. 2). Automatic annotations are denoted by an empty silver star and build the overall pVHL interactome, albeit at a lower confidence level. Manually curated pVHL interactors, represented with a a gold star, have been annotated with the exact molecular details and their functional meaning.
The automatic pVHL interaction network has been generated with queries to the STRING 31 , BioGrid 45 , iHOP 46 , MIPS 47 and IMEx 48 databases. STRING and Biogrid are two of the most popular protein-protein interaction databases. The IMEx Consortium is a long-term coordination project which currently contains twelve interaction databases. MIPS is a database of mammalian interacting proteins while iHOPS is a text-mining based resource parsing the PubMed database for possible statements on a target protein interaction. Both are presented in a human readable format and their data is not associated with a confidence score. All interactions from IMEx, STRING are annotated with this measurement, while BioGrid interactions are poorly annotated. When available, this score is reported in the interactor page so the user can easily assess the interaction quality. The five resources have been queried through the standard user interface using the most general terms, i.e. "VHL" or "pVHL". In all cases, only human interaction data was considered. The results from the different sources have been merged and processed to remove duplicates. Annotation from UniProt 49 , PDB 50 , Gene Ontology 51 , Pfam 52 and MobiDB 21 has been added. Searches in interaction databases allowed us to build the full network, currently containing 478 proteins.

Figure 1. Example of a mutation as displayed in VHLdb.
For each mutation all available details are listed (i.e. coding variant, effect on protein, type of mutation, pVHL surface involved, solvent accessibility, phenotype, thermodynamic predictions and reference) and visualized as a red sphere on the surface-colored pVHL structure. Manually curated pVHL interactions. The manually curated high quality pVHL interaction network is currently composed of 117 proteins. 35 come from a previous publication 13 while the others have been annotated and are presented in this work (see Table 2). Data curation was performed by each expert following an in-house standardized protocol to guarantee reproducibility and correctness. In detail, the manual curation workflow considers a preliminary search in Pubmed 53 and Uniprot 49 using pVHL-related keywords (e.g. "VHL syndrome", "pVHL AND ccRCC") adapted to the interactor under investigation. Keywords were manually selected by curators using the most common keywords found in the VHL syndrome literature, e.g. angiogenesis, proteasome degradation, oxygen sensing. In case of proteins with different synonymous names (e.g. the EGLN protein family also known as PHD) multiple searches were performed. The final nomenclature for each VHLdb entry was chosen using the official HGNC consortium name. Interaction details have been manually extracted from the literature. Pubmed has been searched for papers describing either structural details of the interaction (e.g. pVHL and target protein residues, sequence motifs and domains) and their functional implications. An example of structural details of the interaction is given in Fig. 3. Upon identification, each interactor has been analyzed with Consurf 54 to assess sequence conservation as well as PRISM 55 and Crescendo 56 to predict the spatial localization of the interaction at the residue level. Presence of linear sequence motifs, known to be relevant in protein-protein interactions, post-translational modification or enzymatic cleavage was performed with ELM 57 . The interaction surface was assigned following our classification 13 as summarized in Table 3.   Implementation. VHLdb uses separate modules for data management, processing and presentation. Figure 4 shows a schematic representation of the whole application. To eliminate the need for data conversion, simplifying development and maintenance, all modules share the JSON (JavaScript Object Notation) format to exchange data. The MongoDB database engine is used for storage and Node.js as middleware between data and presentation. VHLdb exposes its resources through a RESTful interface, using the Restify library for Node.js. At the time of writing, VHLdb supports a custom REST API, the search-route, as detailed in the Help page. The user interface is implemented using the Angular.js framework and Bootstrap library. These libraries provide a  Table 3. Distribution of VHLdb interactors and mutations by pVHL interacting surface. For each surface, start and end residues as well as the number of interactors and mutations are reported. The "upon modification'' row indicates the number of proteins which bind the pVHL protein after it has been phoshorylated in some residue. mobile-ready interface, allowing VHLdb to be natively accessed from any kind of device. Structural annotations are displayed with the Web-GL based molecular viewer PV 58 . Custom molecular views have been developed. An "interaction viewer" has been implemented in the entry page to display interaction data and a "mutation viewer" has been implemented in the mutations page. The former allows the user to visualize the pVHL residues interacting with a manually curated interacting protein by highlighting the interacting region on the pVHL structure. The latter displays the location of any mutation on the pVHL structure as a sphere, allowing the user to visually access the structural location of a mutation. VHLdb allows direct download of all pVHL interactions, as well as mutations. The database offers both a graphical web interface and RESTful web services from the URL: http://VHLdb.bio.unipd.it/.

Results
Using VHLdb. VHLdb offers simple yet powerful ways to access its data. First, the navigation bar on top of the home page allows the user to access the mutation or interaction page. The home page features a clickable map, redirecting the user to interface-specific pVHL interaction lists (Fig. 2). The mutation page lists all coding variants (sorted by codon) in a user-friendly searchable, filterable and downloadable table, as well as the previously described mutation viewer. The interaction page features a graphical representation of the manually curated pVHL interaction network organized by interacting surface and a sortable, searchable and filterable table, similar to the mutations one listing all protein-protein interactions. The third element of this page is a table showing Gene Ontology (GO) enrichment analysis results for each surface and GO tree. This page allows download of the complete pVHL interaction set in four different formats (JSON, XML, CSV and TAB separated). Details of any protein can be accessed from the interaction page. This page shows all available annotations for a particular pVHL interacting protein including general annotations from UniProt, manually curated interaction details (if available), sequence annotation from Pfam and MobiDB, structure annotations from PDB, functional annotation from GO and references from PubMed. All these data can be downloaded in a protein-specific way in the formats specified above. A feeedback form is accessible from this page and can be used to report inconsistencies or suggest annotations for a specific pVHL interacting protein. Another way to give feedback and request data submission is the contact page accessible from the navigation bar, featuring two distinct submission forms, for general feedback and specific data submission requests. These messages are manually reviewed by our curators and after validation, i.e. confirmation of user-suggested literature, the proposed data will be added to VHLdb.
VHLdb statistics. VHLdb collects data on 478 pVHL interacting proteins and 1,074 pathogenic somatic or germline pVHL mutations. In total, 117 of 478 pVHL interacting proteins were manually reviewed and constitute the core curated pVHL interaction network. The remaining proteins constitute the automated low confidence pVHL interaction network. For 62 proteins of the core set it was possible to identify the interacting surface (see Table 3). For 55 proteins it was possible to identify the pVHL residues involved in the interaction, and for 10 the residues of the interaction partner as well. For 51 proteins we also defined whether the interaction between pVHL and any other protein is direct or not. Table 2 shows a more detailed listing of the manually curated VHLdb protein set. Statistical analysis shows that the interactor distribution differs among the four pVHL interfaces. Interface A presents 9 exclusive interactors, distributed between sub-interfaces A1 and A2, and is known to bind elongins B and C and cullin 2 to form the VCB complex 6 . Interacting proteins in this region compete with elongins B and C, highlighting pVHL functions beyond the well known HIF-1α degradation. We also found that 190 mutations affect this area, yielding three different VHL phenotypes. E.g. Guanine nucleotide-binding protein subunit beta-2-like 1 and E2F transcription factor 1 (UniProt codes: P63244 and Q01094, respectively) are both known to promote cell cycle progression under different stimuli. A simple database search shows that the two proteins rely on the same interaction interface, suggesting a correlated role, at least for pVHL binding. Their interaction with the same pVHL surface suggests a pivotal pVHL role in controlling cell cycle progression under different stimuli and oxygen concentrations. Similar results were found for the remaining interaction interfaces. In detail, 39 interactors were found for interface B, 6 for interface C and one interactor for interface D, for a total of 827 different mutations distributed among interaction interfaces. Interface B is the HIF-1α binding region and characterized by the largest number of interactors. As a further example, we found that proteins such as tubulin beta, collagen alpha-1(IV) and kinesin bind sub-interface B2 showing that molecular details of functions related to endothelial matrix regulation 15 should correspond to this specific interaction area.

Conclusions
We have presented VHLdb, a novel database collecting curated information on pVHL interactors and mutation effects. It provides comprehensive information of pVHL interactors derived from different sources as a unique structured resource. As detailed information about VHL disease is rapidly increasing, this huge amount of information is scattered in different generalist resources and not promptly reachable by a non-expert user. We expect the VHLdb to be useful for both experimentalists seeking to study pVHL biology in greater details and clinicians aiming to understand the effects of novel pVHL variants. An intuitive pVHL oriented user interface was designed and four different output formats are provided to facilitate data retrieval. VHLdb is also effective for the qualitative study of pVHL pathogenic mutations and interacting proteins. From a total of 478 different interactors, 62 were mapped on the corresponding interaction interface. Moreover, 1,074 somatic and germline pathogenic mutations are reported, increasing the previous set of pathogenic pVHL mutations 35 . This can be particularly helpful for future mutation-correlation studies. Information in VHLdb may serve the scientific community to decipher data derived from tumor genome sequencing projects 59 as well as to provide high quality data to be included in predictive genomics studies 60 . Updates such as error reports and submissions of new data to VHLdb are highly encouraged from the community through the implemented feedback function. For the future, it is envisaged the VHLdb will include more annotations, such as distinct causal relationships between mutations and affected pathways.