Genoppi is an open-source software for robust and standardized integration of proteomic and genetic data

Combining genetic and cell-type-specific proteomic datasets can generate biological insights and therapeutic hypotheses, but a technical and statistical framework for such analyses is lacking. Here, we present an open-source computational tool called Genoppi (lagelab.org/genoppi) that enables robust, standardized, and intuitive integration of quantitative proteomic results with genetic data. We use Genoppi to analyze 16 cell-type-specific protein interaction datasets of four proteins (BCL2, TDP-43, MDM2, PTEN) involved in cancer and neurological disease. Through systematic quality control of the data and integration with published protein interactions, we show a general pattern of both cell-type-independent and cell-type-specific interactions across three cancer cell types and one human iPSC-derived neuronal cell type. Furthermore, through the integration of proteomic and genetic datasets in Genoppi, our results suggest that the neuron-specific interactions of these proteins are mediating their genetic involvement in neurodegenerative diseases. Importantly, our analyses suggest that human iPSC-derived neurons are a relevant model system for studying the involvement of BCL2 and TDP-43 in amyotrophic lateral sclerosis.

L arge-scale genetic datasets, such as those obtained from genome-wide association studies (GWAS) or exome sequencing are becoming increasingly available. Simultaneously, advanced proteomic technologies can generate highquality cell-and tissue-specific quantitative proteomic data (e.g., from immunoprecipitations followed by tandem mass spectrometry [IP-MS/MS] 1,2 or whole-proteome analyses of cells or tissues 3 ). Integration of genomic and proteomic datasets has revealed that genetic variation implicated in rare and common diseases often manifests at the proteome level, for example, by impacting protein complexes or cellular networks 4 . Data from genetics and cell-type-specific quantitative proteomics, therefore, have the potential to inform each other and lead to key molecular and biological insights. However, even for experts in these fields, the relevant data types are often not interoperable. This creates a bottleneck for functionally interpreting genetic data and dissecting the molecular biology of human diseases. In the longer term, difficulties in reconciling these data types will hamper efforts towards gaining mechanistic insights from genetic data and designing therapeutic interventions. To enable the robust, standardized, and intuitive integration of data from genetics and celltype-specific quantitative proteomics, we have developed an open-source computational tool named Genoppi 5 (web application: lagelab.org/genoppi, R package source code: github.com/ lagelab/Genoppi), which we apply to 16 cell-type-specific protein-protein interaction datasets to illustrate the functionalities of the tool.

Results
Genoppi enables integration of proteomic and genetic data. Given log 2 fold change (FC) values between studied conditions (e.g., bait versus control IPs) for multiple experimental replicates, Genoppi identifies proteins with statistically significant log 2 FC by applying user-defined thresholds, displays the data in scatter and volcano plots, and provides options for integrating the results with other forms of data (Fig. 1a, "Methods," and Supplementary Note 1). Genoppi can quality control (QC) a protein interaction dataset 6 by testing whether it is enriched for previously known interaction partners compiled from InWeb_InBioMap (InWeb; includes data from > 40,000 scientific articles 7,8 ; Fig. 1b), iRefIndex 9 , or BioPlex 10,11 . Genoppi further allows the user to subset the protein interactions in these databases based on the provided confidence metrics (Supplementary Note 1). The ability to automatically integrate these databases with experimental data in real-time makes it easy to distinguish between published versus newly identified interaction partners of a protein of interest, thus eliminating the need to extensively interrogate the literature on a case-by-case basis.
Genoppi can also intersect and co-visualize proteomic data with various types of data derived from population genetic studies. One such data type is a list of single-nucleotide polymorphisms (SNPs) derived from GWAS of a disease of interest. Genoppi automatically uses linkage disequilibrium (LD) information from the 1000 Genomes Project 12 to identify proteins in a proteomic dataset that are encoded by genes in LD loci of NHGRI-EBI GWAS catalog 13 or user-defined SNPs 14 ( Fig. 1c and "Methods"). In addition, exome and whole-genome sequencing have been increasingly used to identify genes that have a significant burden of rare mutations in individuals with a particular disease compared to healthy controls. Genoppi is designed to enable the identification of sets of proteins in a proteomic dataset based on such user-defined gene lists 15 (Fig. 1c). Genoppi can also incorporate gene constraint data from gnomAD 16 to label proteins intolerant to loss-of-function mutations (Fig. 1d). When these external datasets are integrated with proteomic results, overlaps are displayed in a number of user-defined ways (as Venn diagrams or superimposed on volcano plots), and statistically tested when appropriate ("Methods" and Supplementary Note 1).
Another feature of Genoppi is the annotation of proteomic data with gene sets from several databases, including HGNC 17 , GO 18,19 , and MSigDB 20,21 , to visually identify groups of proteins that may be overrepresented in a particular dataset (Fig. 1e). Genoppi can also perform tissue enrichment analysis using sets of tissue-specific genes derived from RNA or protein expression data in GTEx 22,23 or the Human Protein Atlas (HPA) 24 ( Supplementary Fig. 1). This feature can be applied to sets of tissue-or cell-type-specific genes uploaded by the user (e.g., celltype-specific genes identified from single-cell RNA sequencing). Finally, it is possible to perform head-to-head comparisons of proteomic experiments performed under different conditions using Genoppi. For example, interaction experiments that were executed with and without drug treatment 6 , or with the wild-type and mutated versions of a protein, to elucidate the cellular effects of either pharmaceutical or genetic perturbations in a particular cell type (Fig. 1f). Overall, Genoppi provides various ways to explore proteomic datasets, to guide hypothesis generation, or to inform targeted follow-up experiments.
Genoppi is an open-source software that is easily accessible and flexible to meet data-driven custom needs in the research community and is available both as an R package and as a Shiny application with extensive documentation and exemplar datasets. In particular, the Genoppi web application provides a simple interactive interface with customizable options and the ability to work with a wide variety of quantitative proteomic datasets (e.g., IP-MS/MS analyses, or whole-proteome analyses of cells or tissues). For example, it is possible to dynamically explore a dataset by changing various technical thresholds and visualizing the results in real-time; furthermore, a search function enables the quick identification of proteins of interest in various plots (Supplementary Note 1). Users can also locally download the generated data and plots to share with collaborators. For users interested in building custom analytical pipelines, the R package has extensive documentation and can be leveraged for these purposes. For instance, the users may choose to analyze their data with an alternative statistical method before performing downstream analyses using Genoppi. In summary, Genoppi is a highly flexible platform for analyzing cell-type-specific proteomic datasets and facilitating data sharing in cross-disciplinary collaborations that are now common in both academia and industry. Further details about options, workflows, and analyses can be found in Supplementary Note 1.
Applying Genoppi to analyze cell-type-specific IP-MS/MS data.
To exemplify how analyzing cell-type-specific proteomic data using Genoppi can uncover convergent and divergent diseaserelevant biology of the same protein in different cell types, we generated IP-MS/MS data for four proteins of interest (BCL2, TDP-43, MDM2, PTEN; baits, hereafter) in four distinct cell lines ("Methods" and Supplementary Data 1). We chose these four proteins because they play important, but not fully elucidated roles in cancer, neurological disease, and psychiatric conditions.
We executed bait IPs in a human-induced pluripotent stem cell (iPSC)-derived neuronal cell line (glutamatergic patterned induced neurons [GPiNs] 25 ) and three cancer cell lines (G401, T47D, and A375; Fig. 2a   conducted in triplicate and quantitated by liquid chromatography followed by label-free liquid chromatography-tandem mass spectrometry (LC-MS/MS; "Methods"). Genoppi was used to: (i) QC, analyze, and visualize all IP-MS/MS results; (ii) identify significant interaction partners of each bait in each cell type; (iii) compare significant interaction partners between cell types; and (iv) integrate these data with published genetic datasets for biological discovery and hypothesis generation (Fig. 2, Supplementary Fig. 2, Supplementary Data 2 and 3, and "Methods").
BCL2 interactomes in cancer and neuronal cells. We first explored BCL2, which is a well-studied oncoprotein functioning as an apoptosis suppressor in a variety of cell types; 26 it was also recently shown to be an important regulator of plasticity and cellular resilience during neuronal development 27 . Intriguingly, data from both human brain tissue and animal models of neuropathological conditions suggest a role for BCL2 in cell death regulation in the mature nervous system 28 and in amyotrophic lateral sclerosis (ALS) 29,30 . However, little is known about the specific role of BCL2 in different human cell types, and insights into its overlapping and differential interaction partners in neurons compared to cancer cell lines could generate actionable biological hypotheses of therapeutic relevance. Across the four tested cell lines, BCL2 had a total of 177 nonredundant statistically significant interaction partners. Interaction partners of any bait protein will hereafter be defined as proteins with a log 2 FC > 0 and false discovery rate (FDR) ≤ 0.1 in the experiment compared to the cell-type-matched IgG control ("Methods"), while non-interactors are proteins that do not fulfill either criteria. Among BCL2 interactors, only five were previously reported in InWeb ( Fig. 2b and Supplementary Data 3). However, when comparing the BCL2 interaction partners in GPiNs versus a cancer cell line (G401), we found both a large set of shared interactors and interactors unique to each cell line (Fig. 2b). Specifically, 23 out of 33 (69.7%) interactors found in GPiNs were also interactors in G401 cells; conversely, 23 out of 108 (21.3%) interactors found in G401 cells were also interactors in GPiNs (P = 7.8e − 11, using a hypergeometric distribution). This indicates that, despite little overlap with InWeb interactors and the identification of many cell-type-specific interactors (10 in GPiNs and 85 in G401 cells), a significant subset of new BCL2 interaction partners (n = 23) were common to two different cellular backgrounds, and thus are likely to be true interaction partners that have not been reported before (Supplementary Data 3).
We next determined the correlation between replicate samples (Fig. 2c) and visualized the differential interactions of BCL2 in G401 cells versus GPiNs (Fig. 2d-f). Superimposing known interaction partners from InWeb with proteins detected in our IPs (Fig. 2d, e) shows that the majority of the interaction partners we identified in both cell types were new (104 of 108 in G401 cells and 33 of 33 in GPiNs; Supplementary Data 3).
To test whether the BCL2 interactome was enriched for known cancer driver genes in G401 cells, and for genes involved in neurological disorders in GPiNs, we integrated genetic and proteomic data in Genoppi. We used datasets of cancer driver genes 31 , genes involved in neurodevelopmental delay, autism spectrum disorders (ASDs) 32,33 , or schizophrenia (SCZ) 34 , as well as a curated list of genes involved in ALS 35,36 (Supplementary Data 4). The statistical analyses compare the enrichment of disease-related proteins among the interactors of the bait protein versus the non-interactors ("Methods"). Thus, the statistical enrichment is always conditional on the proteome being expressed in the tested cell line.
We found that known cancer driver genes were evenly distributed between proteins significantly interacting with BCL2 and other proteins in the G401 immunoprecipitate (i.e., the noninteractors; Supplementary Fig. 3a). In contrast, the BCL2 interactome in neurons was enriched for ALS-associated proteins compared to the overall GPiN immunoprecipitate (P = 0.041, using a hypergeometric distribution; Fig. 2f). We further show that ALS-implicated proteins were not enriched among BCL2 interactors found in any of the three cancer cell lines ( Supplementary Fig. 3b-d), confirming that a potential connection to ALS is a feature of the neuron-specific interaction partners of BCL2. Genoppi thus points to targeted follow-up experiments to explore the biological implications of newly identified neuronspecific interactors of BCL2.
TDP-43 interactomes in cancer and neuronal cells. We further set out to investigate a well-established ALS-associated risk factor, TDP-43. TDP-43 is a ubiquitously expressed RNA-binding protein encoded by the gene TARDBP. Rare mutations in TARDBP have been identified as a cause of familial ALS and frontotemporal dementia (FTD). TDP-43 aggregation is also a common pathological hallmark of both neurodegenerative disorders 37 . To gain insights into neuronal functions of TDP-43, we analyzed the IP-MS/MS data of TDP-43 in GPiNs (Fig. 2c) and found many of its known InWeb interaction partners (P = 0.085, using a hypergeometric distribution; Fig. 2g). We also integrated known ALS-associated risk genes with the TDP-43 interactome in GPiNs and show that the experiments in GPiNs recapitulate known interactions between TDP-43 and VCP, MATR3, FUS, ATXN2, HNRNPA2B1, and EWSR1 (P = 0.046, using a hypergeometric distribution; Fig. 2h).
In the literature, it is highly debated whether the pathologies related to TDP-43 aggregation are due to altered transcriptional regulation or cell toxicity. Interestingly, a large portion of TDP-43 interactors in GPiNs (including two out of six ALS-associated interactors, FUS and HNRNPA2B1) are involved in RNA metabolism, suggesting that the role of TDP-43 in human neurons is mostly related to transcriptional and posttranscriptional regulation, as highlighted using Genoppi's gene set annotation feature (Fig. 2i). Importantly, none of the ALSassociated interactors involved in transcriptional or posttranscriptional regulation were identified among TDP-43 interactors in any of the cancer lines (Supplementary Data 5), suggesting that in a nonneuronal context, TDP-43 may function through a different molecular mechanism involving a discrete set of cell-type-specific interactors.
TDP-43 has previously been studied in human brain homogenates, which are an aggregate of many different cell types 38 , or in HEK cells 39 that are less relevant to its role in ALS or FTD. Our Genoppi analyses illustrate that GPiNs recapitulate the known biology of this protein and its physical interactions to a number of known ALS-related proteins. To confirm the interactors identified in GPiNs, we made biological replicates of the TDP-43 IP, quality-controlled 23 interactor-specific antibodies (Supplementary Data 1), and tested 23 interactors identified by IP-MS/MS through western blots (Supplementary Fig. 4 and Supplementary Data 6). We observed a validation rate of 21/23 (or 91.3%) among interactors that span a wide range of log 2 FC values in our IP-MS/MS data and note that this validation rate is concordant with the FDR cutoff of 0.1 used to separate significant interactors from non-interactors in Genoppi. Importantly, we were able to validate both known TDP-43 interactors reported in InWeb (validation rate of 10/11, or 90.9%; including all five of the tested ALS-relevant proteins) and newly identified interactors (validation rate of 11/12, or 91.7%) with the same degree of success. Among the newly identified interactors, we validated eight out of nine (or 88.9%) interactors found in multiple cell lines and three out of three (or 100%) GPiN-specific interactors. Our validation experiments also span 19 interactors with ≥ two values imputed prior to log 2 FC calculation (see "Methods"); 17 out of 19 (or 89.5%) could be confirmed, strongly supporting the procedure we used to impute missing values in the mass spectrometry data.
We further executed reciprocal IPs of five of the TDP-43 interactors validated by western blots, including three ALSrelevant proteins (MATR3, ATXN2, and FUS) and two non-ALS proteins (RBMX and PARP1; Supplementary Fig. 5a). We tested for the presence of TDP-43 in these IPs and were able to detect it in 80% of the experiments (Supplementary Note 2 and Supplementary Fig. 5b). The reciprocal IPs further validate our IP-MS/MS data and indicate that human iPSC-derived neurons (GPiNs) can be used as a cell model for studying TDP-43 interactions with proteins involved in neurodegenerative diseases. In the future, it will be of interest to test the functional significance of the convergence of ALS risk genes in the TDP-43 pathway and dissect the role of individual interactions between TDP-43-and ALS-related proteins in the context of transcriptional regulation.
Cell-type-specific and cell-type-independent interactions. We performed analogous IP-MS/MS experiments and Genoppi analyses of two more proteins (MDM2 and PTEN) that are also hypothesized to have divergent functions in cancer and neurodevelopment. Similar to the observations for BCL2 and TDP-43, we observed a large set of new interaction partners that can be replicated across multiple cell types, as well as a set of cell-typespecific interaction partners that can inform targeted hypotheses and follow-up experiments (Supplementary Note 3, Supplementary Fig. 2b-h, and Supplementary Data 3).
Most published protein-protein interactions to date were derived from large-scale screens using systems that lack human cell-type-specific information (e.g., highly proliferative cell lines such as HEK293 cells, or yeast two-hybrid screens). This means that high-quality interaction experiments executed in specific human cell types can lead to the discovery of many novel interactions. Indeed, when we combined the IP-MS/MS results of BCL2, TDP-43, MDM2, and PTEN across four different cell lines, we found that only 16.6% (144/870) of the interaction partners were reported in InWeb, meaning that up to 83.4% (726/870) of these interactions were new, offering potentially exciting insights into the biology of these proteins. Stratified by protein, 97.2%, 65.8%, 72.0%, and 93.0% of the interaction partners were new for BCL2, TDP-43, MDM2, and PTEN, respectively. Across cell lines, 73.4%, 80.6%, 79.7%, and 78.8% of the interaction partners identified for the four proteins were new in GPiN, G401, T47D, and A375 cells, respectively. We note that, while the statistics here were calculated based on known interactions curated in InWeb, most of the interaction partners were also new according to the iRefIndex or BioPlex database (Supplementary Data 3).
In Fig. 2b, we show that a sizable subset (~70%) of the newly identified interaction partners of BCL2 in GPiNs can be replicated in G401 cells, supporting the biological validity of these interactions even if they have not been previously reported in the literature. Here, we extended the same analysis to all four baits in all possible pairs of cell lines ( Supplementary Fig. 6 and Supplementary Data 3), showing that although we identified a large set of new interaction partners for each bait, 54.7% (397/ 726) of them can be reproduced in multiple cell types. In other words, 37.8% (329/870) of the interactors are not curated in InWeb nor recapitulated in multiple cell types. However, when we performed western blots to validate TDP-43 interactors identified in GPiNs, we observed comparable validation rates between known InWeb interactors (10/11, or 90.9%), non-InWeb interactors found in multiple cell lines (8/9, or 88.9%), and non-InWeb interactors found only in GPiNs (3/3, or 100%; Supplementary Note 2 and Supplementary Data 6). Overall, these observations support the robustness of the new interaction partners we report in this study and illustrate the remarkable opportunity for biological discovery through cell-type-specific proteomic experiments.
Finally, we clustered the four cell lines based on the overlap of interaction partners for each bait between pairs of cell lines, and observed a clear clustering of cancer cells (G401, T47D, and A375) versus GPiNs for three out of four bait proteins (BCL2, TDP-43, and PTEN; Supplementary Fig. 6). This indicates that, as expected, for these proteins the cancer cell interactomes are more similar to each other than to the neuronal interactomes. We pooled the interaction partners of all four bait proteins in the three cancer cell lines (cancer cell interactors, hereafter) and in the neurons (GPiN interactors, hereafter), and tested for enrichment of disease genes in the pooled interactors. While the cancer cell interactors were not enriched for cancer driver genes, GPiN interactors were nominally enriched for the ALS genes (P = 0.040, using a hypergeometric distribution).
Together, all four tested proteins exhibited a pattern of both unique interaction partners in each cell type and a statistically significant set of shared interaction partners across cell types. New interaction partners of TDP-43 in GPiNs validated in~90% of the cases, illustrating the reproducibility of the generated IP-MS/MS data. Our results further suggest that the neuron-specific interactions of BCL2 and TDP-43 link them to genes implicated genetically in ALS and its related biology. Overall, our data indicate that proteins have different groups of interaction partners, some that are cell-type-specific and some that are conserved across many cell types (i.e., cell-type-independent). In the examples we show in this paper, the cell-type-specific protein-protein interactions in a model of human neurons link the tested proteins more strongly to neurodevelopmental diseases than the interactions identified in cancer cell lines.

Discussion
Several programs are available to the community to analyze raw MS/MS data 40,41 , while other tools, such as ProHits-viz 42 , provide visualization capabilities to summarize protein interaction data as well as communicate quantitative differences between a protein of interest and its potential interaction partners. However, none of these tools focuses specifically on creating a systematic and unified workflow for integrating cell-type-specific quantitative proteomic datasets and genetic information. Genoppi is designed so it can be easily incorporated into any functional genomics pipeline by allowing users to integrate datasets, download the results, and modify the code as needed to extend the software and meet different usability requirements.
Beyond providing Genoppi as an accessible tool, we also apply it to analyze a large set of cell-type-specific protein interaction experiments. Together, these datasets provide insights into the interactome landscape of BCL2, TDP-43, MDM2, and PTEN, and open potential avenues to explore their links to cancers and neurological disorders based on the newly found cell-type-specific interactions. We use the generated data to showcase Genoppi as a resource that can be employed to combine original and published datasets in a simple and clear format, allowing systematic analysis, visualization, and exploration of otherwise heterogeneous proteomic and genetic datasets. Genoppi is available as both an R package and as a Shiny application with documentation and test datasets to get the users started (Supplementary Note 1). We believe that as more genetic and proteomic datasets become available, Genoppi will become an increasingly valuable resource for the scientific community.

Methods
Genoppi documentation. A user-friendly documentation of analytical and visualization features implemented in the Genoppi application (v1.0.0) is provided in Supplementary Note 1; documentation for the accompanying R package is available on GitHub (https://github.com/lagelab/Genoppi). This section provides additional technical details for analyses performed by Genoppi.
Moderated t test for identifying significant interaction partners. Given protein log 2 FC values from ≥ 2 replicates, Genoppi performs a one-sample moderated t test from the limma 43 R package to calculate a two-tailed P value and Benjamini-Hochberg FDR for each protein. Limma was originally developed to robustly identify differentially expressed genes in microarray experiments and has since been used on a variety of data types, including proteomic results 44,45 . The empirical Bayes moderated t test is used in Genoppi, as it is less sensitive to underestimated sample variances and performs best on small sample sizes compared to the classical t test 44 . Throughout the paper, we define significant interaction partners of a bait protein as proteins with log 2 FC > 0 and FDR ≤ 0.1 in the bait versus IgG IP-MS/MS data, but we note that Genoppi allows the user to adjust these thresholds according to their needs.
SNP-to-gene mapping. To generate the precalculated data Genoppi uses for SNPto-gene mapping, we filtered the 1000 Genomes Project 12 (phase 3) dataset to obtain genotype data for unrelated individuals residing in Utah with Northern and Western European ancestry and SNPs with minor allele frequency ≥ 0.05 and missing rate ≤ 0.1. Pairwise LD between SNPs was calculated using a sliding window of 200 kb, which is the default haplotype block estimation distance used in PLINK 46 (v1.07). Next, for each SNP, the LD genomic locus was defined as the region covered by other SNPs that have r 2 > 0.6 with the SNP, ± 50 kb on either end. These parameters were chosen to comply with established community standards 34,[47][48][49] . Using the precalculated LD locus boundaries, Genoppi can then identify all Ensembl 50 protein-coding genes whose coordinates overlap with LD loci given a SNP list of interest. If multiple genes are present in the locus defined by a SNP of interest, all genes are mapped to that SNP.
To verify that SNPs are robustly mapped to genes using the mapping method in Genoppi, we mapped 20 random SNPs to genes using both Genoppi and Disease Association Protein-Protein Link Evaluator 51 (DAPPLE; v0.18 on https://gpbroad. boardinstitute.org), which is a standard tool for SNP-to-gene mapping. DAPPLE uses the following definition for LD locus of a SNP: "the region containing SNPs with r 2 > 0.5… extended to the nearest recombination hotspot." Nonetheless, in our comparison test, 100% of genes mapped from SNPs using DAPPLE were analogously mapped using the Genoppi algorithm, illustrating the robustness of our approach.
Hypergeometric test for assessing overlap enrichment between datasets. One-tailed P values are calculated using a hypergeometric distribution to assess the enrichment of overlap between experimental proteomic results and other gene lists, known protein interactors from InWeb 7,8 , iRefIndex 9 , or BioPlex 10,11 , genes intolerant of loss-of-function (LoF) mutations derived from gnomAD 16 , and tissuespecific genes derived from GTEx 22,23 or HPA 24 . To test overlap with a gene list (e.g., known causal genes for a disease), the "population" (N) is defined as all genes encoding proteins identified in the experimental data and "success in population" (k) is defined as the subset of N that pass the user-defined significance threshold (i.e., genes encoding significant proteins). The "sample" (n) contains genes from the gene list that are found in N and "success in sample" (x) is the overlap between k and n. Similar definitions apply to test overlap with InWeb, iRefIndex, BioPlex, gnomAD, GTEx, or HPA data, except in this case the "population" (N) is the intersection of all genes in the experimental data and in the respective database, while the "sample" (n) is the subset of N consisting of known interactors for a chosen bait in InWeb, iRefIndex, or BioPlex, LoF-intolerant genes defined using a gnomAD pLI score cutoff, or tissue-specific genes in GTEx or HPA (Supplementary Note 1). The hypergeometric test 52 is performed to calculate the statistical significance of having a given amount of success in a population. This procedure tests the statistical significance of the overlap between the proteomic and external datasets, while taking into consideration that only a subset of all proteins (and their corresponding genes) are identified in the proteomic data, meaning that the statistical test is conditional on the proteome of the cell type being tested.
The hypergeometric test is not performed for genes derived from the SNP-togene mapping feature in Genoppi. Correctly testing the overlap between these genes and the proteomic results is a complicated statistical problem that can easily lead to confounded results and inflated P values. Confounders include whether the mapped gene is from a single gene or multigenic locus, the gene length, and its tissue-specific expression pattern, to name a few. To accurately perform this analysis requires a workflow that is dataset-specific and is beyond the scope of Genoppi. To not mislead users, and to ensure that other statistical tests in Genoppi can be considered reliable, we do not test the statistical significance of the overlap between proteomic data and genes mapped from SNPs.
Cancer cell lines. We used the following cancer cell lines: A375 (ATCC CRL-1619), a human malignant melanoma cell line exhibiting a wild-type p53 genotype; G401 (ATCC CRL-1441), a kidney rhabdoid tumor cell line with a wild-type p53 genotype; and T47D (ATCC HTB-133), a human breast tumor cell line with a mutated p53 genotype. All cell lines were plated at a density of 40,000 cells cm −2 (uncoated plates). Cell maintenance media contained 10% fetal bovine serum and PenStrep (1:1000). A375 cells were cultured in DMEM (Thermo Scientific) and split every 2 days (at a 1:12 ratio), G401 were cultured using McCoy's 5A (Thermo Scientific) media and split every 2 days (at a 1:10 ratio), and T47D cells were cultured in RPMI no phenol red (Thermo Scientific) media and split every 3 days (at a ratio of 1:4).
All cell lines were incubated at 37°C, 5% CO 2 . To achieve detachment during passaging, all cell lines were exposed to TrypLE (Thermo Scientific).
Protein extraction and immunoblotting. Total protein extract was obtained by harvesting cells and either processing them immediately or snap-freezing them on dry ice for storage at −80°C 53 . In both cases, cell pellets were washed with phosphate-buffered saline (PBS) and resuspended in 10× packed cell volume (PCV) IP lysis buffer (Pierce), with freshly added Halt protease and phosphatase inhibitors (Thermo Scientific). After a 20 min incubation at 4°C, cells were collected by centrifugation (16,200 × g, 20 min, 4°C) and resuspended in 3× PCV lysis buffer. The concentration of the samples was quantified using the Thermo BCA protein assay, and when not used immediately, samples were stored at −80°C. Samples for immunoblotting were diluted in 6×SMASH buffer (50 mM Tris HCl pH 6.8, 10% glycerol, 2% sodium dodecyl sulfate (SDS), 0.02% bromophenol blue, 1% β-mercaptoethanol), boiled for 10 min at 95°C, separated on a NuPAGE 4-12% Bis-Tris Protein Gel (Thermo Scientific), and transferred onto a PVDF membrane (Thermo Scientific) by wet transfer (100 V for 2 h). Membranes were blocked by incubation for 1 h at room temperature in 10 mL Tris-buffered saline (TBS) and 0.1% Tween (TBST) with 5% (w/v) Bio-Rad Blotting-grade Blocker. Blots were incubated overnight at 4°C with the primary antibody (1:1000), washed three times for 10 min with TBST, and incubated for 45 min with the secondary horseradish peroxidase-conjugated antibody. After washing three times for 5 min with TBST, bands were visualized using ECL (GE Healthcare). All antibodies used in this study are listed in Supplementary Data 1. Immunoprecipitations. For each individual experiment, 1-2 mg of protein extract was incubated at 4°C overnight in the presence of 1-2 μg of the relevant antibody. The next day, 50 μL of Protein G beads (Pierce) were added to each sample and incubated at 4°C for 4 h. Flow-through was collected and beads were washed once with 1 mL lysis buffer (Pierce) supplemented with Halt protease and phosphatase inhibitors (Thermo Scientific), and twice with PBS. Beads were resuspended in 60 μL of PBS and 10% of the volume was employed for immunoblotting, after being boiled in 6×SMASH buffer (50 mM Tris-HCl pH 6.8, 10% glycerol, 2% SDS, 0.02% bromophenol blue, 1% β-mercaptoethanol) for 10 min at 95°C. The remaining volume was stored at −80°C and subsequently used for MS analysis.
Sample preparation for MS. All immunoprecipitated samples (n = 48) and IgG controls (n = 12) were in PBS buffer on beads. PBS was removed and samples were dissolved in 50 µL TEAB (triethylammonium bicarbonate, 50 mM) buffer, followed by trypsin (Promega) digestion for 3 h at 38°C. Digested samples were dried to 20 and 10 µL of each sample was injected in the mass spectrometer.
Mass spectrometry. LC-MS/MS was performed on a Lumos Tribrid Orbitrap Mass Spectrometer (Thermo Scientific) equipped with Ultimate 3000 (Thermo Scientific) nano-high-performance liquid chromatography. Peptides were separated onto a 150-µm inner diameter microcapillary trapping column, packed with~2 cm of C18 Reprosil resin (5 µm, 100 Å, Dr. Maisch GmbH, Germany), followed by separation on a 50-cm analytical column (PharmaFluidics, Ghent, Belgium). Separation was achieved by applying a gradient from 5 to 27% acetonitrile in 0.1% formic acid for > 90 min at 200 nL min −1 . Electrospray ionization was enabled by applying a voltage of 2 kV using a home-made electrode junction at the end of the microcapillary column and sprayed from metal tips (PepSep, Denmark). MS survey scan was performed in the Orbitrap, in a range 400-1800 m/z at a resolution of 6 × 10 4 , followed by the selection of the 20 most intense ions (TOP20) for CID-MS2 fragmentation in the ion trap using a precursor isolation width window of 2 m/z, automatic gain control setting of 10,000, and a maximum ion accumulation of 100 ms. Singly charged ion species were not subjected to collision-induced dissociation fragmentation. Normalized collision energy was set to 35 V and an activation time of 10 ms. Ions within a 10 p.p.m. m/z window around ions selected for MS2 were excluded from further selection for fragmentation for 60 s.
Raw data were analyzed with Proteome Discoverer (v2.4; Thermo Scientific). Assignment of MS/MS spectra was performed using the Sequest HT algorithm by searching the data against a protein sequence database including all entries from the Uniprot_Human2018_SPonly database 54 as well as other known contaminants such as human keratins and common laboratory contaminants. Quantitative analysis between samples was performed by label-free quantitation (LFQ) between different sets of samples. Sequest HT searches were performed using a 10 p.p.m. precursor ion tolerance and requiring each peptide's N/C termini to adhere with trypsin protease specificity, while allowing up to two missed cleavages. CID-MS2 spectra were searched with 0.5 Da ion tolerance for fragmentation. Methionine oxidation (+15.99492 Da) was set as variable modification. An MS2 spectra assignment FDR of 1% was applied to both proteins and peptides using the Percolator target-decoy database search.
Proteomic data preprocessing. Starting with protein level LFQ reports, we performed the following preprocessing steps before inputting the data into Genoppi: (1) performed log 2 transformation and median normalization of protein intensity values in each experimental sample; (2) filtered out contaminants and protein entries supported by < 2 unique peptides; (3) imputed missing protein intensity values in each sample (see details below); (4) calculated log 2 FC for each pair of replicate samples (e.g., bait versus IgG control). All preprocessed data and subsequent analysis results (average log 2 FC, P value, and FDR calculated in Genoppi) can be found in Supplementary Data 2.
In order to derive log 2 FC statistics for each protein detected in the proteomic data, we imputed missing protein intensity values in each sample using a wellestablished approach 55 prior to calculating log 2 FC for a pair of samples. Specifically, to replace each missing value in a sample, we randomly sampled from a normal distribution with a mean of μ − 1.8σ and standard deviation of 0.3σ, where μ and σ are the mean and standard deviation of the observed intensity values in the sample. This procedure works under the assumption that proteins with missing values likely have low intensities below MS detection threshold, and therefore place these proteins in the lower tail of the observed intensity distribution. As positive control examples for this imputation strategy: some of our bait proteins and their known InWeb interaction partners needed to be imputed as they were not detected in the IgG controls; after imputation, they generally showed statistically significant log 2 FC values that overlaid nicely with the non-imputed log 2 FC distribution. As further support, we also successfully validated 17 out of 19 (89.5%) TDP-43 interactors with ≥ 2 imputed values prior to log 2 FC calculation in GPiNs (Supplementary Note 2).

SAINTexpress analysis.
To assess the robustness of the statistical method (moderated t test from the limma R package) implemented in Genoppi when applied to our experimental data, we also used an alternative statistical method, SAINTexpress 56 , to identify significant interaction partners of the four bait proteins in the four cell lines (Supplementary Data 7). For each bait in each cell line, we used the intensity data (from protein level LFQ reports) for the bait versus IgG control samples as input to run SAINTexpress, excluding contaminants and protein entries supported by < 2 unique peptides to be consistent with the filtering used for the analogous Genoppi analysis. We assessed the overlap between the significant interactors identified by Genoppi (log 2 FC > 0 and FDR ≤ 0.1) and by SAINTexpress (BFDR ≤ 0.1; Supplementary Data 7). On average, SAINTexpress identified more significant interactors compared to Genoppi, but~85% of the interactors identified by Genoppi were also identified by SAINTexpress. This indicates that there is good agreement between the two methods when applied to our experimental data, and that most of the significant interaction partners we identified using Genoppi could be recapitulated using an alternative, established method.
Lists of disease-associated genes. In order to investigate the overlap between interactors identified in our proteomic data and disease-associated genes, we compiled several gene lists from published genetic studies (Supplementary Data 4). For cancer, we used a list of 260 genes identified by exome sequencing 31 . For neuropsychiatric disease, we aggregated a total of 571 unique genes that have been implicated in ASD or SCZ. This list includes ASD genes identified by exome sequencing 32 and mapped from genome-wide significant GWAS index SNPs 33 using Genoppi's SNP-to-gene mapping framework. For SCZ, we mapped genomewide significant GWAS regions 34 to genes overlapping these regions. For ALS, we curated a list of 53 genes based on literature review 35,36 .
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Raw IP-MS/MS data generated in this study have been deposited to the ProteomeXchange Consortium via the PRIDE 57 31 , ASD genes from Satterstrom et al. 32 , ASD GWAS loci from Grove et al. 33 , SCZ GWAS loci from PGC 34 , ALS genes from Farhan et al. 35 and Volk et al. 36 . Source data are provided with this paper.