Extensive disruption of protein interactions by genetic variants across the allele frequency spectrum in human populations

Each human genome carries tens of thousands of coding variants. The extent to which this variation is functional and the mechanisms by which they exert their influence remains largely unexplored. To address this gap, we leverage the ExAC database of 60,706 human exomes to investigate experimentally the impact of 2009 missense single nucleotide variants (SNVs) across 2185 protein-protein interactions, generating interaction profiles for 4797 SNV-interaction pairs, of which 421 SNVs segregate at > 1% allele frequency in human populations. We find that interaction-disruptive SNVs are prevalent at both rare and common allele frequencies. Furthermore, these results suggest that 10.5% of missense variants carried per individual are disruptive, a higher proportion than previously reported; this indicates that each individual’s genetic makeup may be significantly more complex than expected. Finally, we demonstrate that candidate disease-associated mutations can be identified through shared interaction perturbations between variants of interest and known disease mutations.

Thick black bars are the interquartile range, white dots display the median, and extended thin black lines represent 95% confidence intervals. P values by one-tailed U-test. (f) Co-expression of protein abundance levels for protein interaction pairs used in this study. Interacting protein pairs were significantly more likely to be co-expressed than random protein pairs in tissue and cell data from the Human Proteome Map. P value by two-sided KS test. See also Supplementary Note 1.     Figure 8. Uncropped Western blots for SEPT12-SEPT1 co-IPs. (a) Westerns for SEPT12 co-IP of SEPT1 using α-FLAG beads. SEPT1 is tagged with 3×HA and detected using α-HA.
(b) Westerns for SEPT12 co-IP of SEPT1 using α-FLAG beads. SEPT12 is tagged with 3×FLAG and detected using α-FLAG. eGFP tagged with 3×FLAG is included as a negative co-IP control. (c) α-GAPDH control for SEPT12-SEPT1 co-IP ran on a stripped membrane. In (a-c), black boxes indicate where figures were cropped for Western blot in Fig. 5e. * indicates 50 kDa marker.

Supplementary Note 1. Selection of Y2H protein-protein interaction pairs from a reference interactome.
To select interaction partners for each mutant protein tested in our SNV-perturbation screen, we first leveraged a Y2H reference interactome comprised of over 14,000 known wild-type protein-protein interactions reported in four manuscripts 1-4 . Since these published interactions are retestable by our version of Y2H, we only tested SNVs that corresponded with protein-protein interactions from this reference interactome. This requirement for retestable wild-type interactions found in the literature-reported reference interactome dramatically reduces the search space in which we probe for disruptive SNVs and prevented the need for an all-by-all Y2H interaction screen.
On average, each protein in the reference interactome has between 2-3 interaction partners. We note that we tested 847 unique genes against 2,185 corresponding interaction partners (~2.5 interaction partners per gene-encoded protein). For each wild-type protein-protein interaction, we then tested whether corresponding SNVs for each interaction can perturb that interaction. This consisted of 2,009 SNVs found on 847 unique genes. Since each of these genes has ~2.5 interaction partners, this results in a total of 4,797 SNV-interaction pairs tested (Fig. 1c).
We further note that Y2H has been extensively demonstrated to detect biologically meaningful interactions across many organisms in many studies 1, [3][4][5][6][7][8] . To further confirm the biological significance of the interactions used in this study, we examined the co-expression of protein abundance levels corresponding to interactions used in our study. Using protein expression levels for 30 adult and fetal tissues and cell types from the Human Proteome Map 9 , we found that proteins corresponding to interactions used in our study were significantly more likely to be co-expressed than random protein pairs, confirming the in vivo biological significance of the interactions used in the study (Supplementary Figure 1f).

Supplementary Note 2. Calculating the fraction of disruptive missense variants per individual.
We note that allele counts in ExAC correspond predominantly to common alleles (MAF > 1%). Therefore the fraction of disruptive alleles that are common will have the greatest influence on the average number of interaction-disruptive variants per individual. Disruptive and total allele counts corresponding to Fig. 2b are presented below in Supplementary Table 4 as reported in Fig. 2c, where the error is calculated by the delta method.

Supplementary Note 3. Categorizing stable, moderately stable, and unstable mutant proteins.
Plate reader raw data from each 96-well plate consists of two fluorescence readings corresponding to GFP and mCherry expression in each well for proteins expressed in pDEST-DUAL vector. Wildtype/mutant groups are segregated to be on the same plate so that they can be processed together. Each plate is allocated eight wells for background controls: four wells transfected with empty pDEST-DUAL vector such that only mCherry expression is expected, used as a GFP baseline, and four wells transfected with empty pcDNA-DEST47 vector where no GFP or mCherry expression is expected, used as a mCherry baseline. All expression values are normalized as a z-score representing the number of standard deviations away from the mean background expression.
Supplementary Figure 10: All fluorescence readings are represented as a z-score away from the controls in that plate's 12 th column. A12-D12 serve as GFP background using empty pDEST-DUAL vector, E12-H12 serve as mCherry background using empty pcDNA-DEST47 vector.
Next, we apply basic quality control filters. A fluorescence reading is considered significant if the P value associated with its z-score on the background normal distribution is less than 0.05. We only perform analysis on experiments with significant wildtype expression for both GFP and mCherry channels. Further, we filter out any mutants that do not present significant mCherry expression.
We calculate wildtype activation and fold change to determine whether a mutant well is underexpressing GFP relative to its corresponding wildtype. Wildtype activation is the ratio between the GFP zscore and the mCherry z-score in the wildtype well for that ORF and is reported as "Wildtype stability score" in Fig. 3b. Similarly, mutant activation is the ratio between the GFP z-score and mCherry z-score in the mutant well for an ORF and is reported as "Mutant stability score" in Fig. 3b. We then calculated fold change as the ratio of mutant activation over wildtype activation, reported in Fig. 3d as the "stability score ratio." All values are reported in Supplementary Data 5.
WT Activation = Z(GFP WT ) Z(mCherry WT ) (2) Fold Change = As an added quality control step, experiments with WT Activation less than 1.0 are removed. All other experiments are then classified into three groups: stable if the fold change is above 0.5, moderately stable if the fold change is between 0.5 and 0.0, and unstable if the fold change is less than 0.0.

Supplementary Note 4. Dissecting the impact of variants in disease.
Studying complex disease is difficult since very often no single variant alone is fully penetrant. Nonetheless, the simplest case of complex disease, digenic inheritance in which two genes both contribute to a single phenotype, is actually quite prevalent. Searching HGMD 10 yields a total of 365 mutations that contribute to digenic inheritance. Moreover, a search through PubMed for "digenic mutations" yielded 378 papers, although this strictly refers to cases in which two heterozygous mutations in different genes must occur together for the disease phenotype to manifest. Cases in which the impact of a disease-causing mutation in one gene is influenced by a polymorphic variant in another gene are far more common and are extensively documented in HGMD. Such variants are often only partially penetrant, resulting in disease in only particular genetic backgrounds 11 . While dissecting how these variants modulate each other's impact is not straightforward, individually assessing the impact of these variants in isolation is a crucial first step towards understanding how these variants function epistatically. In this context, our study represents an important resource for examining what fraction of population variants are functional and could conceivably play a role in disease risk as a result.

Supplementary Note 5. Examining the potential drug-relevance of disruptive SNVs
The results of our study may have important implications in related fields such as pharmacogenomics and toxicogenomics. Disruptive SNVs on enzymes may alter the metabolic kinetics of impacted enzymes, while SNVs on transporters and targets of drugs may lead to changes in the pharmacokinetic and pharmacodynamic properties of their corresponding proteins. For example, the D816H/V mutations on the receptor tyrosine kinase, KIT, confers resistance to imatinib and sunitinib by shifting the conformational equilibrium of KIT 12 . To explore the potential relevance of our SNV disruption data, we generated a dataset of disruptive SNVs potentially relevant to pharmacogenomics and toxicogenomics by intersecting our SNVs disruption dataset with four sets of genes: all human enzymes, drug-metabolizing enzymes, drug targets, and drug transporters. The list of all human enzyme genes was obtained from HumanCyc version 21.5 13 , while the lists of drug-related genes were obtained from DrugBank version 5.1.2 14 . Among the SNVs that we tested, 350 were on enzymes, and 84 of them disrupted at least one interaction. A table consisting of all disruptive SNVs that may be relevant to drug action is provided in Supplementary Data 7.

Supplementary Note 6. Background and motivation for Protein Complementation Assay.
Protein Complementation Assay (PCA) is a protein-protein interaction assay performed in HEK 293T cells in which a bait and prey protein are fused to two complementary fragments of a fluorescent protein, YFP.
If the bait and prey protein successfully interact, the two YFP fragments will stably bind and fluoresce as a result. PCA is a particularly valuable assay in protein-protein interaction screens because it is highthroughput, and it provides an independent assay for validating the quality of protein-protein interactions detected through Y2H screens. For this reason, PCA is commonly used in many interactome screens, including in Arabidopsis 7 , yeast 3,8 , and human 4,15 . Notably, PCA can also be used to validate that two proteins do not interact, which is important when testing the impact of disruptive variants. To do this, loss of fluorescence signal in PCA for mutant interaction pairs relative to wild-type pairs is measured to validate that Y2H-tested mutations are indeed disruptive 16,17 .