Capturing variation impact on molecular interactions: the IMEx Consortium mutations data set

The current wealth of genomic variation data identified at the nucleotide level has provided us with the challenge of understanding by which mechanisms amino acid variation affects cellular processes. These effects may manifest as distinct phenotypic differences between individuals or result in the development of disease. Physical interactions between molecules are the linking steps underlying most, if not all, cellular processes. Understanding the effects that amino acid variation of a molecule’s sequence has on its molecular interactions is a key step towards connecting a full mechanistic characterization of nonsynonymous variation to cellular phenotype. Here we present an open access resource created by IMEx database curators over 14 years, featuring 28,000 annotations fully describing the effect of individual point sequence changes on physical protein interactions. We describe how this resource was built, the formats in which the data content is provided and offer a descriptive analysis of the data set. The data set is publicly available through the IntAct website at www.ebi.ac.uk/intact/resources/datasets#mutationDs and is being enhanced with every monthly release.

Many diseases are caused by specific mutations, and prognosis or response to treatment is frequently mutation-specific. The study of how mutations affect molecular interactions is thus of extreme interest since it can help ascertain the role of specific protein residues on the universal function of molecular binding. Several studies [1][2][3][4] have explored the impact of disease-related variation in molecular interaction networks, using structural studies and computational predictions to attempt to both identify variation-affected interfaces and predict the effect of specific variants on interactions. These studies suggest that interaction interfaces contain a significantly higher rate of disease-related variants than the rest of the molecule and that variant location in these interfaces can determine disease specificity.
Despite available high-throughput interaction screening platforms, the experimental validation of these variation effect predictions on a systems-scale remains a major challenge. However, these data can be found reported in the literature but difficult to search and concatenate. Researchers have for many years been examining the effect of single, or multiple, induced point mutations on both binary and n-ary interactions in small-scale experiments. Targeted changes to the amino acid sequence of a protein have been engineered, largely by site-directed mutagenesis, with the aim of mimicking known variants 5,6 , removing known, or predicted, post-translational modifications 7,8 , disrupting regions required for protein stability or altering the properties of protein binding domains 9,10 , and their effects of the interaction of interest monitored. It has been the work of the IMEx Consortium 11 to capture such information into a single data set and thus make it available for researchers to re-use and reanalyse. IMEx Consortium annotators follow a detailed curation model, capturing not only full details of the experiment (including interaction detection method, participant identification method and the host organism) but also a description of the constructs used. This may include the co-ordinates of deletion mutants used to derive a minimum binding domain and also the effect of point mutations. Databases in the Consortium perform detailed, archival curation of published literature and also receive pre-publication data through direct submissions. This close collaboration with data producers often entails access to unpublished details in the data, such as experiments reporting mutations that have no effect on interactions, which enables the capture of added value for the scientific community.
Here we describe the largest literature-derived data set, to our knowledge, capturing the effect of sequence changes over interaction outcome. We discuss how the data set was generated and how it is maintained by the EMBL-EBI IntAct team. We also provide an initial analysis of the data set, highlighting its overlap with genomic variation data, discussing possible biases and exploring its potential as a benchmarking tool for variant effect prediction tools.
The IMEx mutations data set: data curation and quality control The IMEx Consortium databases have been collecting point mutation data for over 14 years, which has resulted in a sizeable data set of almost 28,000 fully annotated events (www.ebi.ac.uk/intact/resources/datasets#mutationDs). The IMEx resources curate interaction data into structured database fields, and from there into community standard interchange formats, and each observation is described using controlled vocabulary terms. Mutations are mapped to the underlying protein sequence in UniProtKB and updated in line with changes to that sequence, to ensure that they stay mapped to the correct amino acid residue with every proteome release.
In order to make the mutant data set more accessible to the biomedical scientist, the Consortium has released the mutation data set in a tab-delimited format (Box 1), which includes details of the position and the amino acid change of the mutation, the molecules in the interaction and the effect of the mutation on the interaction, as well as additional fields containing contextual information.
Additionally, a data-update pipeline has been specifically developed to ensure the accuracy of the annotation of mutation events as interaction participant features (suppl. figure 1). The construction of this pipeline has been made possible by the creation of specific fields capturing sequence changes in our recently developed standard format PSI-MI XML3.0 12 . It is run in coordination with the IntAct database monthly protein update procedure, which ensures synchronization with UniProtKB 13 and automatically shifts feature positions if there are changes in referenced protein sequences. The pipeline has been applied to the entire data in the IntAct database (www.ebi.ac.uk/intact), in which all IMEx data, and also legacy data generated by the IntAct, MINT, DIP and UniProt curation groups is housed (see Supplementary Methods for details on re-annotation and data update procedures). The mutation data update pipeline will continue to be run in quality control mode with every release of IntAct to ensure the mutation data set is kept entirely up to date with UniProtKB.

Data set statistics
The full IMEx mutations data set contains 27,868 fully annotated events in which a sequence change has been experimentally tested in an interaction experiment. All this information has been manually curated, representing over 33,000 person-hours' worth of biocurators' work, and it is continuously growing with on-going IMEx curation activities. The 4,353 proteins annotated come from 297 different species, with over 60% of the events annotated in human proteins and roughly 90% annotated in seven main model organisms (see table I).
In total, 13,926 interaction evidences are annotated with differentially reported effects, using the PSI-MI controlled vocabulary. Most of the effects reported are of a 'deleterious' nature, either disrupting (10,976 annotations, 39.3%) or decreasing the interaction (8,553 annotations, 30.7%), but there is a significant number of interactions that are either strengthened (2,256 annotations, 8.1%) or caused (188 annotations, 0.7%) by the mutation when compared with the wild type sequence (figure 1a). The data set also includes those mutations that were experimentally tested but found to have no effect over the interaction (3,057 annotations, 11%) and 'undefined' mutations that were present in constructs used in the experiment but where the comparison with the "wild type" reference is either absent or not possible (2,838 annotations, 10.2%).
It is important to note that the 'causing' and 'no effect' mutation effect categories have been only recently adopted into the controlled vocabulary and captured by the biocurators, so they have a much lower number of annotations and are not directly comparable with the other categories. Currently, the only resources that represent the impact of amino acid substitutions on binding events are the SKEMPI database 14 , UniProtKB and IMEx Consortium member databases through IntAct (see table II for a detailed comparison). Of these resources, IMEx is the biggest and the only one that can provide easily accessible, systematically described, up-to-date annotations. UniProtKB mutagenesis annotations record whether a change in sequence affects an interaction, but the experimental context is not captured and the effects are described in a semi free-text field that is difficult to parse. SKEMPI offers a detailed overview of sequence change effects on binding derived from in vitro experiments, recording changes in affinity and other kinetic parameters. Only very specific interaction detection methods, using purified proteins, are considered, which limits its scope.
The IMEx Consortium is currently formed by 11 groups, each one with their own area of interest, that have agreed to use the same curation standards and data representation download formats. All members of the consortium [15][16][17][18][19][20][21][22] use the curation platform provided by the IntAct team at EMBL-EBI. Figure 1e shows the number of events annotated by each data resource. Large databases such as IntAct, DIP and MINT, with an exclusive focus on interaction data curation, have produced the majority of the annotations, but a sizeable part of the data set has been entered by other, domain-specific, members of the Consortium.
According to the IMEx schema and curation policy, interaction evidence, rather than interacting pairs of molecules, is the focus of the data representation. This results in the curation of multiple distinct pieces of evidence describing the same interacting pairs and offers a way to weight how well characterized is a given interacting group of One reason for these contradictory results may be differences in experimental methodologies used to measure the effect, since IMEx databases recognize a large variety of experimental approaches that provide molecular interaction evidence.
The vast majority of the data set refers to amino acid substitutions, with a marginal amount of insertions and deletions reported (only 65 deletion and 83 insertion annotations). Figure 2a shows that arginine, leucine and serine are the most frequently replaced residues, while histidine and methionine residues are mutated less often (see suppl. figure 3a for a more detailed view on specific replacements). Alanine is by far the most frequently used residue for replacement (figure 2b), which is probably reflective of the widespread use of alanine scanning 23 to identify residues critical for binding to other molecules, either because they are found on the interacting interface or at an allosteric binding site. When we checked the relative proportion of the different mutation effects per replacing residue (figure 2c, suppl. figure 3b), alanine replacements mostly associate with deleterious effect on interactions. The dominance of deleterious effects most probably reflects the authors of the original study using alanine scanning to locate binding-related residues.

Genomic variation and the IMEx mutations data set
In this era of deep-sequencing genomics, there is a wealth of data concerning nonsynonymous genomic variants. As discussed before, the motivation behind the design of these experiments varies, and only a fraction were specifically designed to systematically test known variants vs reference ("wild type") versions of the participant proteins 5,24 . Hence, we decided to explore how much of currently available information for natural or disease related variation can be linked to the data set. Because of the strong predominance of human data both in IMEx mutations and in variation data sets, we decided to focus on human proteins only.
We used the EMBL-EBI Proteins API 25  We also checked how many of the mapped variant annotations have been linked to disease according to UniProtKB. Disease associations were complemented with data from the DisGeNET database 29 . There were disease-associated variants for 42% (840) of the proteins queried, with a median value of 4 disease variants mapped per protein. As seen in Table III,

Variant effect annotation: computational predictions and literature curation
There is currently a variety of computational tools used to annotate variation data sets 35 . These tools can report the effect of variation on protein function, folding or binding, usually based almost exclusively on sequence or structural data, or can also report genome-derived parameters such as allele frequencies or conservation scores.
We wanted to study how variation annotations provided by these tools align with experimental effect over interaction as reported in the literature.
For this purpose, we used mutfunc (www.mutfunc.com) 36  We first examined the predicted destabilization effect of mutations on structural models of protein-protein interfaces, dividing them by the literature-reported effect. As can be seen in figure 4a, mutations with a 'decreasing' and especially a 'disrupting' effect over interactions had a significantly higher predicted destabilization effect than those with no effect, a difference that was not seen in mutations that would strengthen or even cause an interaction. These "deleterious" groups also contained a significantly higher proportion of mutations predicted to be very destabilizing for interfaces (figure 4b).
We next studied genome-derived parameters that are useful to study variation, such as residue conservation or natural allele frequencies. The experimentally-observed impact on binding stability that we report in our data set may also be reflected on these parameters. This assumption was partially confirmed using two independent measurements. First, we used the 'sorting intolerant from tolerant' (SIFT) method 37 , observing that the proportion of variants with low tolerance scores was significantly higher in all groups where an effect was reported vs the 'no effect' reference ( figure   4c). We also checked allele frequencies as derived from ExAC data. Again, mutations with a reported effect seemed to have significantly lower allele frequencies (figure 4d) and a higher proportion of alleles with extremely low frequencies (figure 5e) than those reported to have no effect over interaction.
The interaction-perturbing effects reported in the IMEx data set can be caused by modifying overall protein structure or by alteration of binding interfaces. We can determine if the mutations reported fall within sequence regions associated with binding using both computational predictions and literature-reported experimental data. We obtained predicted interfaces, based on available structural data, from Literature bias in the IMEx mutations data set IMEx databases have a wide scope when selecting publications for curation and it is reasonable to assume that the proteins in this data set are representative of the interaction data that has been explored in the literature. Socially-driven, literature bias is a well-known phenomenon previously reported for literature-curated data sets 24,41 so we decided to explore to what extent it affects the data set.

Discussion
Here we present a unique resource containing experimental, publicly available information about the impact of sequence changes on specific protein-protein interaction outcomes. This is a direct result of the IMEx Consortium full-detail curation policies and represents an example of how expert curation, resulting in structured and standardized representations, is required in order to make the most of published experimental results. In comparison to similar, pre-existing data sets recording variation influence over interactions, this resource represents a leap forward in depth, size and scope (table II). A previous, relatively small study 43 reported a curated list of about 100 mutations influencing interactions. This was used as benchmark in a study investigating the link between disease-related variation and interaction interfaces 1 , showing an application of this type of data, despite obvious limitations due to its size.
The curation infrastructure and practices of the IMEx consortia will enable the capture of data from a growing number of deep-mutagenesis interaction studies, where hundreds if not thousands of single amino acid changes over the whole length of a protein sequence are explored for their influence on interactions 44 .
We have also acknowledged the social biases inherent to any literature-based resource in our data set, although it is difficult to ascertain its extent. Alanine scanning features prominently as a commonly used technique (figure 2b) and may represent amino acid changes that will never be seen in nature due to evolutionary constraints or simply because they would require extensive sequence alteration at the DNA level, However, in order to unravel the mechanisms behind detected effects, it is key to explore how molecular interactions are affected 46 . Currently, most of the mechanistic insight into variation effects is generated by computational annotation and predictions, using tools that are based on relatively small reference sets, generally based on structural data. As an example, the widely used FoldX algorithm is generated from protein complex structures and has been tested against a library of 1,008 mutants 39 .
Our current data set already provides interaction effects for over 10 times more individual variants and is not limited to structural data. The wide scope of experimental setups represented (figure 1c) allows the capture of effects on proteins and protein regions that might be intrinsically non-structured 47 . We show that the data set gives a currently unparalleled and representative overview about which residues are key for protein interactions, with the results being in good accordance with commonly used variant annotators ( figure 4). IMEx curation practices originally did not enforce capturing sequence changes that had no effect over interaction outcome, but as a result of consultations with tool developers and data users this policy has been amended and the data set now features a growing number of mutations with no effect that can be used as a training negative set for the development of computational annotation tools.
The IMEx mutations data set represents both a reference source for direct, literaturebased variant characterization and a unique benchmark that can be used to further refine computational variant effect annotators. We will continue to expand the data set and improve its accessibility for users, as a part of IMEx global mission of ensuring data representation and re-use.

Software and packages used
The quality control pipeline for mutation annotations was developed and integrated within the production code used in the IntAct database. The code is written in Java and makes use of the Hibernate and Spring frameworks for interaction with the core SQL database and application implementation. Specific implementation details are available upon request. Statistical analysis, plots, mutation re-annotation checks and mappings were performed using the R programming language 48 through the RStudio programming suite 49 . The following R packages were used in the study: data.  UniProtKB. Otherwise, they were considered 'partial' mappings. Disease annotations were extracted from the API's output, along with rsIDs. These rsIDs were then used in DisGeNET to search for additional disease annotations that were brought in as well.

Predicting impact on protein interaction interfaces
Experimental and homology modelled structures for protein interactions were obtained from the Interactome3D database 38   'mild conflict' cases are those in which the mutation is sometimes reported as having some effect vs others in which there is no detectable effect.     , the set of proteins annotated with no effect and the remaining proteins in IMEx (non-mutated). In "no effect" word enrichment analysis, the words "pathway" and "action" have been removed to make remaining words more visible (original word cloud available as supplementary figure 6a), while in "common mutated" word cloud the words "pathway" and "signalling" have been removed (original word cloud available as supplementary figure 6b). The analysis in this figure was performed taking into account human proteins only.