Machine learning analysis identifies genes differentiating triple negative breast cancers

Triple negative breast cancer (TNBC) is one of the most aggressive form of breast cancer (BC) with the highest mortality due to high rate of relapse, resistance, and lack of an effective treatment. Various molecular approaches have been used to target TNBC but with little success. Here, using machine learning algorithms, we analyzed the available BC data from the Cancer Genome Atlas Network (TCGA) and have identified two potential genes, TBC1D9 (TBC1 domain family member 9) and MFGE8 (Milk Fat Globule-EGF Factor 8 Protein), that could successfully differentiate TNBC from non-TNBC, irrespective of their heterogeneity. TBC1D9 is under-expressed in TNBC as compared to non-TNBC patients, while MFGE8 is over-expressed. Overexpression of TBC1D9 has a better prognosis whereas overexpression of MFGE8 correlates with a poor prognosis. Protein–protein interaction analysis by affinity purification mass spectrometry (AP-MS) and proximity biotinylation (BioID) experiments identified a role for TBC1D9 in maintaining cellular integrity, whereas MFGE8 would be involved in various tumor survival processes. These promising genes could serve as biomarkers for TNBC and deserve further investigation as they have the potential to be developed as therapeutic targets for TNBC.

www.nature.com/scientificreports/ One way of extracting useful information is by machine learning. Machine learning (ML) is a computer-based algorithm and statistical model which uses data as a training model, learns from the data pattern and inferences and improves with experience (number of times it reads the data), without detailed programming to do the desired task 11 . Different algorithms can be used such as Decision tree (DT), Random forest (RF) and Set covering machine (SCM). A DT uses a tree-like graph that comprises decision models consisting of all the possible outcomes 12 . RF is a classification or regression method consisting of multiple DT, where the final output is the modes (for classification) or means (for regression) of all the outputs clubbed together from every DT 12 . SCM is an algorithm whose goal is to learn a conjunction or a disjunction of rules. This is achieved by finding the decision function depending on the smallest number of attributes 13 .
In the present study, we have analyzed a dataset consisting of 877 BC patients from The Cancer Genome Atlas Program (TCGA) by three different machine learning algorithms: DT, RF and SCM. The analysis identified 20 genes, out of which two genes were characterized further, namely TBC1 Domain Family Member 9 (TBC1D9), a GTPase activating protein, and Milk fat globule-EGF factor 8 (MFGE8), also known as lactadherin, which is a membrane glycoprotein. These identified genes were able to differentiate TNBC from non-TNBC, irrespective of their heterogeneity. The protein-protein interaction analysis highlights their potential as therapeutic targets for this highly aggressive subgroup of BC.

Results
Machine learning algorithms identify potential genes differentiating TNBC from non-TNBC. Treatment of TNBC requires a gene or a gene set that can simply differentiate TNBC from all other BC subgroups taking into consideration the complexity of its classification. With this goal, we analyzed the TCGA-BRCA dataset from TCGA portal by three different ML algorithms, namely SCM, DT and RF (Fig. 1).
Selection of three potential genes based on BC patient's survival outcome. The 20 genes identified by ML were further analyzed to come up with the most promising genes. We first analyzed the effect of Figure 1. Machine learning (ML) analysis pipeline. A dataset consisting of 877 patients was selected, which comprises data from RNAseq, methylome and miRNA analysis. This dataset was divided into a training set (80% patients) and a test set (20% patients). The training set (blue) was used to train three ML algorithms: decision tree (DT), Random forest (RF) and Set-covering machine (SCM) to identify genes differently expressed in Triple negative breast cancer (TNBC) or non-TNBC. The process was repeated 100 times (N = number of repeats), to come up with the best learner model. This model was then applied to the test set (Red) to get the grouping of genes according to TNBC/non-TNBC subtypes. This process was also repeated 100 times to validate our findings. The output is a conjunction of rules for SCM and tree/(s) for DT/RF, which led to 20 potential genes. www.nature.com/scientificreports/ these genes on survival outcome in 2,164 patients from 16 different datasets ( Fig. 2A). The heatmap is based on the meta z-score obtained when survival Z scores are collapsed by cancer/cancer subtype. We then selected the top three genes with the best or the worst effect on survival outcome, and with the maximum repeats in ML analyses. This led to the following three genes: TBC1D9 (TBC1 domain family number 9), SLC16A6 (Solute Carrier Family 16 Member 6) and MFGE8 (Milk Fat Globule-EGF Factor 8 Protein). Expression of TBC1D9 and SLC16A6 had better survival outcome among BC patients whereas expression of MFGE8 had poor survival outcome ( Fig. 2A). We further investigated the outcome of these three genes on survival of BC patients for which we used the online tool developed by Gyorffy et al., known as KM (Kaplan-meier) plotter (Fig. 2B) 14 . The analysis showed that BC patients with high expression of TBC1D9 had better survival outcome for distance metastasis free survival (DMFS) and post-progression survival (PPS), with a p-value of 0.0014 and 0.0088 respectively. SLC16A6 also showed similar results for both DMFS (p-value = 0.072) and PPS (p-value = 0.011), but the p-value for DMFS was not significant. On the other hand, BC patients with high expression of MFGE8 had poor survival outcome for both DMFS (p-value = 0.019) and PPS (p-value = 0.031).
The three selected genes effectively differentiate TNBC from non-TNBC in different patient cohorts. Using the TCGA provisional dataset from the cBioPortal for cancer genomics, the expression pattern of the three selected genes in 1,101 patients was verified (Fig. 3A). Based on their expression pattern, TBC1D9 and SLC16A6 expression were higher in non-TNBC, whereas MFGE8 was more expressed in TNBC patients.
The expression level of these genes was also verified in the original dataset consisting of 877 patients and the same result was obtained (Fig. 3B). To validate our findings in an independent cohort, 13 TNBC and 12 non-TNBC patients were selected from the tissue bank of Centre des Maladies du Sein (Hôpital du St-Sacrement, Quebec, Canada) with the aid of senior pathologists, and the expression of these three genes was verified by qPCR in these samples. The expression pattern in these samples further confirmed our findings (Fig. 3B). For MFGE8 expression, we obtained a p-value of 0.72 when comparing non-TNBC (Luminal A, Luminal B and HER2) to TNBC. However, a p-value of 0.16 was obtained when the expression of MGFE8 was compared in the non-TNBC (excluding HER2 subgroup) vs TNBC group.
Protein-protein interaction analysis highlights a role for TBC1D9 in maintaining cellular integrity. The evidence of association of TBC1D9 expression with better survival of BC patients led us to explore the role of this protein with regard to its interacting partners. Hence, AP-MS and BioID experiments were performed to identify their interactors and in turn understand their role in biological processes. The data obtained from AP-MS or BioID experiments were analyzed using SAINTexpress. Enforcing a SAINTexpress BFDR cutoff of ≤ 0.01, 68 and 77 significant interactors were identified by AP-MS and BioID, respectively (Supplementary  Tables 1 and 2). These genes were further filtered for possible non-specific interactors utilizing the Crapome portal yielding final datasets of 52 and 67 significant protein interactors by AP-MS and BioID, respectively (Supplementary Table 3). Compared with the already known interactors from Biogrid (https ://thebi ogrid .org/) (Supplementary Fig. 1A), four proteins were identified by AP-MS (MAP1LC3B, ARL8A, CPT1A and SRSF2), one with BioID (ABHD16A) and five with both AP-MS and BioID (PRPF38B, DDX41, YME1L1, SSB and SNRPE). Out of them, only two proteins were significant according to our cut-off: ARL8A (BFDR = 0) and ABHD16A (BFDR = 0.01) (Supplementary Tables 1 and 2).
The proteins identified by both methods were further analyzed using the metascape online tool to better understand their roles. The circos plot in Fig. 4A depicts the comparison analysis of the data obtained with AP-MS and BioID. The red part of the circle represents the AP-MS data whereas the blue is for BioID data. The inner circle (orange) represents each protein identified. The dark orange colour represents proteins that appear in multiple lists. The proteins overlapping in different gene ontology (GO) terms are connected with the purple line. To understand which pathways these proteins affect, a comprehensive protein-protein interaction (PPI) network was generated by metascape involving both AP-MS and BioID data, based on this interaction network (Fig. 4B). The unique PPI from metascape applies Molecular Complex Detection (MCODE) algorithm to the resultant networks to identify tightly connected network cores. Then it analyzes each network component for pathway enrichment and based on them, finally assigns biological functions. The analysis highlighted the enrichment of pathways related to metabolism of lipids and organelle localization by both AP-MS and BioID (Fig. 4C). The processes affected by metabolism of lipid pathway are metabolism of lipids (logP = − 5.6), glycerophospholipid metabolic process (logP = − 3.0), and glycerolipid metabolic process (logP = − 2.3) (Supplementary Table 4). For organelle localization the major processes affected are organelle localization (logP = − 3.8), microtubulebased processes (logP = − 3.5), loss of Nlp from mitotic centrosomes (logP = − 3.4), AURKA activation by TPX2 (logP = − 3.4), centrosome maturation (logP = − 3.2), regulation of PLK1 activity at G2/M transition (logP = − 3.1), and recruitment of NuMa to mitotic centrosome (logP = − 2.9) ( Supplementary Table 4).
Protein-protein interaction analysis highlights a role for MFGE8 in many oncogenic processes. AP-MS and BioID experiments were performed for MFGE8 to explore its protein-protein interaction network to better understand the high expression of MFGE8 in TNBC and its correlation with poor prognosis in breast cancer. The analysis of AP-MS and BioID were done in a similar way as TBC1D9. One   Tables 5 and 6). These interactors were further filtered by Crapome data, which led to 123 interactors by AP-MS, and 9 interactors by BioID (Supplementary Table 7). When compared with the known interactors from Biogrid ( Supplementary Fig. 1B), one protein by AP-MS (ABCE1) was found, two proteins by BioID (YTHDF2 and MTDH) and one was found by both (FUS), but none fell in the cutoff of BFDR ≤ 0.01 (Supplementary Tables 5 and 6). Protein interactors for MFGE8 were also analyzed by metascape online tool to identify the biological processes MFGE8 is involved into. Not many significant interactors were found by BioID which is depicted in circos plot Fig. 5A, where red part is AP-MS and blue is BioID. Few protein interactors sharing the GO terms were found, www.nature.com/scientificreports/ which is represented by purple colour lines in the circos plot. A PPI network was prepared to identify biological functions (Fig. 5B). According to the interactors identified by AP-MS, an enrichment of protein deglycosylation (logP = − 9.8), carbohydrate derivative biosynthetic process (logP = − 9.4), mitochondrial tRNA aminoacylation (logP = − 5.4), protein quality control for misfolded or incompletely synthesized proteins (logP = − 4.9), cofactor metabolic process (logP = − 4.9) and lysosome organization (logP = − 4.8) (Fig. 5C) were identified. For BioID, an enrichment of small molecule catabolic process was obtained (logP = − 3.3) (Fig. 5C, Supplementary Table 8). Besides, few proteins by BioID involved in cofactor catabolic process (ALDH1L2, DXDR), protein deglycosylation (TRIM13, DAD1) and carbohydrate derivative catabolic process (TRIM13, DAD1 and DXDR) were also identified (Fig. 5C).

Discussion
TNBC, the most heterogeneous and aggressive BC, lacks any effective therapy to date. TNBC response to neoadjuvant therapy looks promising on first look, and accounts for a pathological complete response (pCR) of around 30-40% at the time of surgery 5,15 . Yet, any traces of the residual disease after neoadjuvant therapy results into a 6 times higher risk of relapse and a more than 12 times risk of metastasis 5,16 . Moreover, the mean survival time for patients who relapsed is less than 13 months 17 . Additionally, response to chemotherapy further varies with different subtypes of TNBC. According to Masuda et al., BL1 has the highest pCR rate (52%), whereas BL2 and LAR subgroups show the lowest pCR rate (0% and 10% respectively) 16 . If the intrasubtype variation is added, it further complicates the outcomes of treatments. Many studies have been reported to understand the molecular traits of TNBC to come up with potential therapeutic targets 18,19 . Many of these targets are under clinical trials and target growth factor receptors (EGFR, cMET, VEGFR), downstream signalling (PI3K/mTOR pathway, SRC, WNT signaling), cell cycle checkpoints (CHK1/2), PARP inhibitors, the androgen receptor and so on (Clinical-Trials.gov). Most are effective only for a subgroup of breast cancer and yet have not been very promising due to many other underlying factors. In this study, we have identified genes that could differentiate TNBC from other BC, irrespective of their heterogeneity.
In this study, we have taken advantage of the vast amount of available data stored in TCGA. We selected a dataset consisting of 63% Luminal A, 16% Luminal B, 5% HER2+ and 16% TNBC, representative of the BC prevalence. The analysis of the dataset by ML led to 20 potential genes differentiating TNBC from non-TNBC. We identified 15 downregulated and 5 upregulated genes in TNBC as compared to non-TNBC. Of most significant importance is the identification of ESR1 (estrogen receptor) which was downregulated in TNBC, further confirming the efficacy of ML analysis. These genes were further evaluated for their survival outcome across 40 different cancers. Most strikingly, the identified genes gave a trend of better or poor survival outcome only in BC patient samples based on their expression pattern ( Fig. 2A). Further analysis of the three selected genes (based on survival outcome and number of repeats by ML analysis), TBC1D9, SLC16A6 and MFGE8, showed that each has an effect on DMSF and PPS (Fig. 2B), where the expression of the first two genes (TBC1D9 and SLC16A6) have better survival outcome. On the opposite, MFGE8 displays poor survival outcome. Since TNBC is the most aggressive form of BC and the chances of metastasis and relapse are very high, this finding suggests that these genes might be playing an utmost important role in the TNBC recurrence and spread.
The analysis of TCGA-BRCA RNAseq dataset confirmed that these genes are indeed able to differentiate TNBC from non-TNBC patients (Fig. 3A). The expression of these three genes was further validated in tissue samples. The same expression pattern as in the ML analysis, i.e. TBC1D9 and SLC16A6 were downregulated in TNBC, whereas MFGE8 was upregulated, was obtained (Fig. 3B), particularly after exclusion of HER2 subtype samples, although statistical significance was not reached because the sample size was too small. These results are not surprising since the expression of several genes was found to be highly correlated with HER2 status measured at the RNA levels (TCGA analysis), but less correlated at the protein levels (samples analysis) 20 .
TBC1D9 is a GTPase activation protein whose expression has been shown to be linked to low mortality and recurrence in breast cancer 21 . SLC16A6 is a transporter for monocarboxylates across the plasma membrane. Polymorphisms in SLC16A6 gene have been reported in breast cancer 22 . MFGE8, also known as lactadherin, is known to promote phagocytosis of apoptotic cells and has been shown to induce the tumorigenic potential of mammary epithelial cells 23 .
We further investigated the two promising genes TBC1D9 and MFGE8 to uncover their role in TNBC. The interactors of TBC1D9 showed that it has a role in organelle localization, metabolism of lipids and organelle biogenesis and maintenance. The most important interactor of TBC1D9, ARL8A (ADP Ribosylation Factor Like GTPase 8A, fold change = 30), which has been also identified as an interactor of TBC1D9 by AP-MS (Biogrid data), is a GTPase known to bind to lysosome and therefore recruiting lysosome to the microtubule for its trafficking to periphery, resulting in cell migration 24 . Lysosome trafficking leading to exocytosis helps in extracellular matrix remodelling and membrane repair during cancer 25 . Nugues et al., have suggested that exocytosis by lysosome has an important role in mitosis 26 . It has been shown that upon binding of ARL8 to the lysosome, the lysosome is recruited to the cytoplasm where it internalizes circulating triacylglycerides and cholesterol esters to release fatty acids and glycerol, leading to continuous production of ATP required for rapid proliferation in cancer cells 27 . TBC1D9, which is a GTPase activating protein, acts on ARL8A (a GTPase) by inactivating this protein, therefore regulating proliferation, migration, membrane repair, extracellular matrix remodelling and mitosis (Fig. 6A). We have also identified PLK1 (polo-like kinase 1, fold change = 155) as an interactor of TBC1D9, which has a role in microtubule nucleation resulting in microtubule formation 28,29 , therefore affecting trafficking of lysosome. Active PLK1 enters nucleus where it plays an important role in mitosis 30,31 . It might be possible that TBC1D9 inhibits the activation of PLK1, and therefore regulating these processes, but this will need to be further evaluated. www.nature.com/scientificreports/ As for MFGE8, it is known to interact with the phosphatidylserine (PS)-enriched surfaces, mostly labeling the apoptotic cells. After interacting with PS, it binds to integrin on the macrophage leading to M2 polarization by activation of STAT3 signalling resulting in tumor promotion and pro-oncogenic inflammatory response 32 (Fig. 6B). Vallabhapurapu et al., have shown that many viable cancer cells express PS on its surface, which is recognized by macrophages resulting in immunity to antitumor drugs 33 . Furthermore, these tumor associated macrophages secrete MFGE8 34 , which has been shown to increase epithelial to mesenchymal transition (EMT), invasion and mitosis by activating Twist 1 (Twist-related protein 1) 35 , and survival and resistance to stress by activating PI3k/AKT (Phosphoinositide 3-kinase/Protein kinase B) pathway 35 . Our data have highlighted a role for MFGE8 in mitochondrial tRNA aminoacylation, through its interactions with many aminoacyl t-RNA synthetases which could result in regulating tRNA maturation and proofreading, RNA splicing, amino-acid editing, and tmRNA aminoacylation protein synthesis 36 , resulting in regulating the shift from oxidative to glycolytic metabolism, prominent in cancer cells via activation of the PI3K-PTEN-AKT pathway 36 . We have also found that MFGE8 interacts with OS9 (Osteosarcoma Amplified 9, Endoplasmic Reticulum Lectin). OS9 interacts with HIF1α (Hypoxia Inducible Factor 1 Subunit Alpha) and leads to the degradation of HIF1α 37 . In a cancer hypoxic condition, MFGE8 might interact with OS9 and therefore releasing HIF1α, resulting in tumor growth and metastasis. However, this concept will need further evaluation. On the other hand, MFGE8 also interacts with SEL1L (Suppressor of Lin-12-Like Protein 1), whose role is in protein glycosylation. Protein glycosylation is an important event in cancer progression, as incorrect glycosylation can lead to cancer progression and metastasis 38 .
The search to identify genes differentiating TNBC from non-TNBC has led to the identification of two potential genes, i.e. TBC1D9 and MFGE8. The bioinformatics results show that TBC1D9 might have a role in maintaining cellular integrity and therefore its expression is related to better survival outcome, whereas MFGE8 has a role in several oncogenic processes resulting in poor survival outcome in BC patients.
The major problem in treating TNBC is its heterogeneity. Even the 6 molecularly classified (or after refinement four classes) TNBC subgroups display gene expression resembling other BC. As each patient shows different characteristics, the genes identified in this study serve the long desiring aim of finding common patterns across all TNBC patients, as these could be further developed as potential therapeutic targets. Targeting MFGE8 in TNBC would be possible as it is overexpressed in TNBC. Hence it could be downregulated by using MFGE8 specific inhibitors. Moreover, effector molecules of MFGE8 could also be targeted. However, TBC1D9 is downregulated in TNBC. Therefore the question arises as to how to appropriately target a gene which is downregulated in a disease. This could be performed with various approaches such as targeting its regulators, by gene therapy, a vaccine approach, or by inhibiting the activated pathways (in this case ARL8A) due to inhibition of TBC1D9. These approaches are currently being investigated for various tumor suppressor genes and are of great utility 39,40 .
The approach described in this study combines multiple disciplines linking clinical information, -omics data, machine learning algorithms and bioinformatics tools, and has proved to be useful and adequate to provide candidate genes that deserve to be pursued further.

Materials.
Constructs for the genes of interest were generated via Gateway cloning into pDEST 3′ 3xFLAG-

Machine learning analysis.
Three different algorithms were utilized to analyze the TCGA dataset: DT 42 , RF 43 and SCM 13 . In the supervised ML settings, we assume that the data are available as a set S def = {(x i , y i )} m i=0 ∼ D m where x i ∈ X is a training example, y i ∈ Y the associated label or phenotype, D is a data generating distribution and m the size of the dataset. We focus here on the binary classification problem i.e. y ∈ {−1, 1} or y ∈ {0, 1} . The goal of every learning algorithm is to obtain a predictor h : X → Y such that h(x) = y∀(x, y) ∼ D . Originally introduced by Marchand et al., the SCM is a greedy algorithm whose goal is to learn a conjunction or a disjunction of rules 13  SCM has 2 hyper-parameters: the penalty p and the early stopping point s . They are the two model-selection parameters that give the user the ability to control the proper trade off between the training accuracy and the size of the function. The SCM performs efficiently in the classification framework problem with focus on interpretable and sparse models. A decision tree is a tree where each node represents a feature (attribute), each link (branch) represents a decision (rule) and each leaf represents an outcome (categorical or continues value). For instability purpose, we also use the RF algorithm which is essentially a bunch of DTs with majority vote. For the analysis the main matrices used were accuracy, precision and F1-score. These matrices present how well the algorithm performs on the negative and positive example simultaneously. Since the dataset is unbalanced, we focused our attention on the F1-Score, because if the F1-score is high, that means our algorithm is performing well and vice versa where, TP = true positive, TN = true negative, FN = false negative, FP = false positive.
In order to avoid any chances of randomness, we repeated the learning experiences 100 times and presented the result of all the repeats. From these results, we retrieved the ten best models after the 100 repetitions. In the case of the SCM, it is straight forward since all the models were conjunctions type. For the DTs and the RF, we make the assumptions for the model to consider only the first three features and then do the count with these features. The detailed flow chart of ML analysis is presented in Fig. 1. GTPase activating protein, inactivates ARL8. When active, ARL8 localizes on the lysosomal membrane and regulates the lysosomal positioning on microtubules that leads to trafficking of lysosome to periphery, therefore resulting in cell migration. ARL8 is essential for membrane repair and extracellular matrix remodelling by controlling the exocytosis of lysosome 25 . ARL8 bound lysosomes are also recruited to the circulating triacylglycerides and cholesterol esters, which are internalized by lysosomes and broken down into fatty acids and glycerol, respectively, leading to continuous ATP production essential for cell proliferation. A study have also shown that lysosomal exocytosis is essential for correct mitosis 26 . PLK1 has a role in microtubule nucleation and mitosis. In the present study, a PLK1 interaction with TBC1D9 has been identified, which could possibly lead to inactive PLK1 by controlling its disassociation from its partners. This hypothesis however needs to be verified. (B) MFGE8 is known to interact with phosphatidylserine which is expressed in the apoptotic cells.
After interaction, the complex binds to the integrin on the macrophages. This binding results in more secretion of MFGE8. MFGE8 activates effrocytosis of apoptotic cells. The process of effrocytosis and MFGE8 itself activates STAT3, leading to M2 polarization, tumor promotion and cancer stem cell activation. The apoptotic signaling also induces ER stress, resulting in tolerogenic phagocytosis, which also secretes MFGE8. This leads to EMT, invasion, mitosis by twist induction or survival and resistance to death by activation of the PI3K/AKT pathway. Upon ER stress induction, HIF1α is induced. This might be due to the interaction of secreted MFGE8 with HIF1α inhibitor OS9, therefore releasing HIF1α, resulting in tumor growth and metastasis. MFGE8 also binds to SEL1L, whose increased expression has been seen in metastasis and cancer stem cell activation. The mechanism by which MFGE8 might regulate SEL1L is an open question. www.nature.com/scientificreports/ Quantitative real-time PCR (q-PCR) analysis. Quantitative PCR was performed using SyBr Green technology as described previously 44 . Briefly, oligo-primer pairs that allow the amplification of ~ 200 base pairs (bp) of the indicated specific mRNA were designed by GeneTools software and their specificity was verified by blasting the GenBank database. The sequence of primers is indicated in Supplementary Table 9. Second-derivative and double-correction method were used for data calculation and normalization 45 , with three housekeeping genes (ATP50, HPRT1 and GAPDH). The mRNA levels were expressed as number of copies/µg of total RNA calculated using corresponding standard curves. The qPCR result was analysed using the Wilcoxon rank sum test. Streptavidin-based affinity capture of biotinylated proteins. The Streptavidin Sepharose beads were washed twice in 1 mL of lysis buffer (60 µL of slurry per sample). The beads were pelleted by centrifugation after each washing and were then incubated with samples at 4 °C on rotator for 3 h. The bound beads were pelleted by centrifugation and were washed twice with 1 mL of RIPA lysis buffer (without protease inhibitors) and transferred to a new 1.5 mL Eppendorf tube to minimize background contaminants. Tubes were centrifuged and the supernatants were discarded. The beads were washed three times with 1 mL of 50 mM Ammonium bicarbonate (ABC). www.nature.com/scientificreports/ C 18 -StageTips in a fresh tube and adding 20 µL of buffer B. The tubes were centrifuged at 3,000 rpm for 30 s. The process was repeated three times. The elute was dried using SpeedVac and sent for LC-MS/MS analysis.

Sample preparation for LC-MS/MS analysis.
Proteins identification by mass spectrometry. The analyses were performed at the proteomic platform of the Quebec Genomics Center. Peptide samples were separated by online reversed-phase (RP) nanoscale capillary liquid chromatography (nanoLC) and analyzed by electrospray mass spectrometry (ESI MS/MS). The experiments were performed with a Dionex UltiMate 3,000 nanoRSLC chromatography system (Thermo Fisher Scientific) connected to an Orbitrap Fusion mass spectrometer (Thermo Fisher Scientific) equipped with a nanoelectrospray ion source. Peptides were trapped at 20 µL/min in loading solvent (2% acetonitrile, 0.05% TFA) on a 5 mm × 300 µm C 18 pepmap cartridge pre-column (Thermo Fisher Scientific) for 5 min. Then, the precolumn was switched online with a self-made 50 cm × 75 µm internal diameter separation column packed with ReproSil-Pur C 18 -AQ 3-μm resin (Dr. Maisch HPLC) and the peptides were eluted with a linear gradient from 5 to 40% solvent B (A: 0,1% formic acid, B: 80% acetonitrile, 0.1% formic acid) in 90 min, at 300 nL/min 47 . Mass spectra were acquired using a data dependent acquisition mode using Thermo XCalibur software version 3.0.63. Full scan mass spectra (350-1,800 m/z) were acquired in the orbitrap using an AGC target of 4e5, a maximum injection time of 50 ms and a resolution of 120,000. Internal calibration using lock mass on the m/z 445.12003 siloxane ion was used. Each MS scan was followed by acquisition of fragmentation spectra of the most intense ions for a total cycle time of 3 s (top speed mode). The selected ions were isolated using the quadrupole analyzer in a window of 1.6 m/z and fragmented by Higher energy Collision-induced Dissociation (HCD) with 35% of collision energy. The detection of resulting fragments was done by the linear ion trap with an AGC target of 1E4 in rapid scan rate and a maximum injection time of 50 ms. Dynamic exclusion of previously fragmented peptides was set for a period of 20 s and a tolerance of 10 ppm 48 .

Data dependent acquisition MS analysis.
Mass spectrometry data was stored, searched and analyzed using the ProHits laboratory information management system (LIMS) platform 49 . Thermo Fisher scientific RAW mass spectrometry files were converted to mzML and mzXML using ProteoWizard (3.0.4468) 50 . The mzML and mzXML files were then searched using Mascot (v2.3.02) and Comet (v2012.02 rev.0). The spectra were searched with the RefSeq database (version 57, January 30th, 2013) acquired from NCBI against a total of 72,482 human and adenovirus sequences supplemented with "common contaminants" from the Max Planck Institute (https ://141.61.102.106:8080/share .cgi?ssid=0f2gf uB) and the Global Proteome Machine (GPM; https ://www.thegp m.org/crap/index .html). Charges + 2, + 3 and + 4 were allowed and the parent mass tolerance was set at 12 ppm while the fragment bin tolerance was set at 0.6 amu. Deamidated asparagine and glutamine and oxidized methionine were allowed as variable modifications. The results from each search engine were analyzed through TPP (the Trans-Proteomic Pipeline (v4.6 OCCUPY rev 3) 51 via the iProphet pipeline 52 .
Statistical analysis of AP-MS and BioID data. The MS data generated by AP or BioID were analyzed using SAINTexpress 53 . For AP-MS data, four uncompressed untagged controls were used while for BioID samples, 24 control samples were compressed to 12. Significant protein interactors were those found to have a Bayesian False Discovery Rate (BFDR) ≤ 0.01. Significant interaction partners were further refined using the Crapome online tool (https ://crapo me.org/) to remove proteins commonly co-purified in FLAG tag pulldown. For this, a cut-off of 0-20 was enforced (i.e. proteins which have been identified 0-20 times with FLAG-tag out of 411 experiments). The resulting dataset was used for subsequent analysis with Metascape software (https ://metas cape.org/).

Ethical approval.
All patients provided written informed consent. Ethical approval of the study was obtained from the Research Ethics Committee of the Centre de Recherche du CHU de Québec, Canada.

Consent for publication.
The study was carried out in accordance with the relevant guidelines and regulations.