Identification of VWA5A as a novel biomarker for inhibiting metastasis in breast cancer by machine-learning based protein prioritization

Distant metastasis is the leading cause of death in breast cancer (BC). The timing of distant metastasis differs according to subtypes of BCs and there is a need for identification of biomarkers for the prediction of early and late metastasis. To identify biomarker candidates whose abundance level can discriminate metastasis types, we performed a high-throughput proteomics assay using tissue samples from BCs with no metastasis, late metastasis, and early metastasis, processed data with machine learning-based feature selection, and found that low VWA5A could be responsible for shorter duration of metastasis-free interval. Low expression of VWA5A gene in METABRIC cohort was associated with poor survival in BCs, especially in hormone receptor (HR)-positive BCs. In-vitro experiments confirmed tumor suppressive effect of VWA5A on BCs in HR+ and triple-negative BC cell lines. We found that expression of VWA5A can be assessed by immunohistochemistry (IHC) on archival tissue samples. Decreasing nuclear expression of VWA5A was significantly associated with advanced T stage and lymphatic invasion in consecutive BCs of all subtypes. We discovered lower expression of VWA5A as the potential biomarker for metastasis-prone BCs, and our results support the clinical utility of VWA5A IHC, as an adjunctive tools for prognostication of BCs.


Proteomic profiles of the discovery set
A total of 9455 proteins were quantified across the discovery set, and 6639 proteins were quantified with labelfree quantification (LFQ) intensities after filtration and imputation.Differentially expressed proteins (DEPs) were identified, where 830 proteins were found to be differentially expressed among the groups with an false discovery rate (FDR)-adjusted p-value < 0.1 (Supplementary Table S2).Among the DEPs, we found that the expression of VWA5A sequentially decreased from the NM group (median LFQ, 29.6; min-max, 28.9-30.3),LM group (median LFQ, 29.2; min-max, 28.0-28.3),and M group (median LFQ, 28.2; min-max, 27.2-29.3)(Supplementary Fig. S1).
To further characterize the implication of DEPs, we went through the recursive feature addition experiment, until the prediction performance of the support vector machine (SVM) classifier reached 0.8 precision, 0.79 recall, and 0.79 f1 score (Supplementary Fig. S2).After each of 29 leave-one-out cross-validation (LOOCV) experiments, a different set of features were selected.The nine biomarker candidates were finally chosen as those that were selected in at least 3 LOOCV experiments, which included VWA5A.The t-SNE embeddings of patients with the nine biomarker candidates could clearly discriminate metastasis type (Fig. 1A).The nine makers showed significantly high mutual information (MI) and network propagation (NP) scores compared to the other proteins (Supplementary Fig. S3), and when considering the combined ranks in MI and NP score, VWA5A outstood as the highest among the nine biomarker candidates, 2nd in the MI ranking and 1st in the NP ranking (Fig. 1B, Supplementary Fig. S3).

www.nature.com/scientificreports/
To cross-check the results of the external validation, we assessed the relationship between LFQ values of VWA5A and the DFS of patients in the discovery set.We found the significant prognostic impact of VWA5A protein levels in discriminating DFS in the discovery set (p < 0.0001; HR 0.18; 95% CI 0.08-0.38).

Biological role of VWA5A assessed by in vitro functional assay
Strong prognostic significance of VWA5A expression found in the pooled analysis METABRIC database, but not in certain subtypes of BCs indicate that this molecule may behave differently according to the intrinsic subtypes.Therefore, we sought to assess whether the knockdown of these genes can affect the biological behavior in an HR+ cell line (T47D) and two TNBC cell lines (BT20 and HCC70).
After transfection with siVWA5A, T47D, BT20, and HCC70 cell lines showed marked reduction of VWA5A expression assessed by RT-PCR (Fig. 3A).We also confirmed showed stronger propensity toward invasion and migration after transfection with siRNA against VWA5A in all three of the experimented cell lines (Fig. 3B).No morphological changes were observed after knockout of VWA5A.
To analyze the effect of expression of the gene VWA5A on cell proliferation, breast cancer cell lines were divided into subtypes, and cell proliferation and protein expression levels were confirmed (Supplementary Fig. S5).In the luminal (MCF7 and T47D) and TNBC type cell lines (BT20, HCC1143, HCC1937, MDA-MB-231, MDA-MB-468 and HCC1395), the lower the VWA5A expression, the faster the cell proliferation.On the other hand, in the HER2 type, the higher the VWA5A protein expression, the faster the cell proliferation.These results demonstrate that lower protein expression of VWA5A leads to faster cell proliferation in luminal and TNBC cells, suggesting that VWA5A acts as a tumor suppressor and that the expression of VWA5A is inversely correlated with cell proliferation.Interestingly, cell proliferation assay after transfection of the T47D line with siVWA5A showed that the cellular proliferation capability is reduced compared to the control (Supplementary Fig. S5c).This suggests that at least in HR+ BC cells, increased invasive and migratory behavior after VWA5A knockdown may be the result of increased cell-intrinsic metastatic potential, not explainable by proliferative capability alone.Taken together, these in vitro functional analyses confirmed the tumor suppressive effect of VWA5A on BCs especially regarding invasive and migratory potentials, therefore, supporting our discovery on proteomics-driven biomarker screening.

Clinicopathological characteristic of BCs with high VWA5A expression
To validate the biological significance of VWA5A expression on BC, we performed IHC for VWA5A on the tissue microarrays (TMA) composed of surgically resected BC tissue samples from the validation set.Of the 1003 patients in the validation set, 966 (96.3%) samples had available tumor cells on TMA for adequate interpretation of VWA5A IHC results.42.7% of the patients were classified as VWA5A-high group, while 57.3% were VWA5A-low (Fig. 4A).
Various clinicopathological features were compared between VWA5A-low and VWA5A-high groups and summarized in Table 2.We found that patients in the VWA5A-low group were more likely to be diagnosed with higher pT stages (p = 0.028; linear-by-linear test), and the decrease in the H-scores of VWA5A was associated with advanced T stage in the consecutive BC population (p = 0.014; Kruskal-Wallis test; Fig. 4B).In addition, BCs www.nature.com/scientificreports/with lower expression of VWA5A were more likely to show lymphatic invasion (p = 0.009; Table 2).Though no significant association between VWA5A expression and pN stage was seen in the total consecutive BC validation set (p = 0.481), lower VWA5A expression was associated with positive pN stage within HR+/HER2− subgroup (p = 0.036; Supplementary Table S3).Regarding the patients' survival, we found no significant differences in MFS (p = 0.125; HR 0.68; 95% CI 0.41-1.12)and OS (p = 0.089; HR 0.40; 95% CI 0.13-1.20).Though survival differences were not found, distinct clinicopathological features of the IHC-defined VWA5A-low group in the validation set support the aggressive behavior of VWA5A-low tumors and suggest the clinical implication of VWA5A expression on the BCs.

Discussion
In this study, we performed high-throughput proteomics analysis on the FFPE samples of BCs with varying times between initial surgery to metastatic event.We used a machine learning-based feature selection scheme on the proteomics data, and implemented MI and NP to narrow down into the potentially meaningful biomarkers regulating metastatic potential.
Since the introduction of high-throughput profiling technologies, enormous expression data on transcriptomics and proteomics levels has been accumulated.Differential expression analysis or overrepresentation analysis are commonly used to identify biomarkers of interest within the dataset; however, it is often very difficult to interpret the influence of the markers within the context of biological network.In contrast, NP can simulate the effects of biomarkers in a biological network by propagating the influence of the candidate molecule via protein-protein interactions (PPI) analysis.
By using this method, we identified and found VWA5A as the most valuable marker of metastatic potential of BC.VWA5A gene, also known as BCSC1 or LOH11CR2A, is located within the chromosome 11q23-q24, which has been known to be frequently (ranging from 45 to 63%) deleted in various cancers including breast, ovary, uterine cervix, and lung [11][12][13][14][15] .Within the genomic loci, VWA5A gene was first cloned in 1997, and the subsequent functional study suggested that VWA5A acts as a tumor suppressor gene 16,17 , where overexpression of VWA5A in MCF7 cell line resulted in enhanced tumorigenicity 16 .However, interrogation of the clinical implication of VWA5A in BC has not been thoroughly performed.
In our proteomics discovery data, the expression of VWA5A gradually decreased as the clinical features of each group worsened, from NM, LM, and finally M groups, which was in line with previous studies that designated VWA5A as a tumor suppressor 18 .Additional biological network-based feature selection also suggested this molecule as the reliable biomarker representing metastasis-prone BCs.Notably, our discovery set was composed of variable subtypes of BCs, though HR+HER2− type dominated.Strong association of VWA5A in this heterogeneous discovery population and metastasis-related clinical behavior prompted us to hypothesize that the assessment of VWA5A protein expression could be used as the initial screening method for prediction of poor prognosis, regardless of the BC subtype.
To gain more evidence on the clinical implication of VWA5A, we performed a survival analysis using the publicly available METABRIC dataset.Though significant differences in DFS and OS were not ascertained in and TNBC subsets, a strong association was found in the HR+ group.Nevertheless, additional in vitro experiments on HR+ and TNBC cell lines using siRNA showed that VWA5A loss resulted in a marked increased in invasiveness and migratory potential of these cells, suggesting that VWA5A still matters at least in HR+ and TNBC subtypes of BCs.The limitations of our in vitro experimental design include lack of VWA5A overexpression assay, rescue experiments of VWA5A, and in vivo functional validation of VWA5A.However, consistent consequences of VWA5A knockdown across multiple types of BC cell lines cautiously support the tumorsuppressive nature of the biomarker.
The pervasive role of VWA5A in suppressing variable types of BCs may result from the common cellular mechanism associated with metastatic behavior.VWA5A lies in the vicinity of CHEK1, PIG8, and ROBO3 loci, and this genomic region is known to be commonly deleted together 19 .Therefore, CHEK1 deletion coupled with VWA5A loss would result in impaired DNA damage response, contributing to genomic instability and resultant aggressive behavior of the BC cells [20][21][22] .Another possible explanation would include the interaction between tumor cells and extracellular matrix.It is reported that the tissue expression of VWA5A showed an inverse correlation with the expression of MMP14 23 , a form of matrix metalloproteinase (MMPs).MMPs play pivotal roles in tumor invasion and metastasis by participating in ECM degradation and activating other MMP family members in various types of solid tumors [24][25][26][27] .In addition, it would be intriguing to investigate whether the loss of VWA5A causes tumor cell-intrinsic alterations-i.e.epithelial mesenchymal transition (EMT)-along with ECM modulation, as suggested by previous reports [28][29][30][31] .
In the area of breast cancer research, recent advances were developed mostly using cutting-edge technologies, and it is often very difficult to robustly implement the findings into routine clinical practice.However, we found that expression of VWA5A can readily be performed by IHC on archival FFPE tissues.Nuclear expression of VWA5A on the surgically resected BC tumor cells could easily be assessed by the pathologist semi-quantitatively, and as the expression levels decreased, significant correlations with advanced T stage and the presence of lymphatic invasion were noted in the validation cohort composed of consecutive BCs of all subtypes.Pathological T stages and lymphatic invasion status were factors retrieved from the pathology reports of surgically resected BCs, which means that these features can seldom be assessed in the limited, needle biopsy specimens.Therefore, assessment of VWA5A IHC on the initial diagnostic needle biopsy specimen of BCs of all types can provide not only prognostic information but also clinically meaningful adjunctive information, especially the ones that cannot be readily assessed in biopsy samples.
The lack of significant association between VWA5A IHC levels and survival may stem from the intrinsic limitation of the validation cohort-the population was composed of consecutive patients with surgically resected BCs, thus only a limited number of patients developed distant metastasis in these operable cases.Therefore, future studies using the initial breast biopsy samples of metastatic BCs would be able to further ascertain the performance of VWA5A expression on predicting the metastatic regulatory potential.
Another limitation of our study is the relative underrepresentation of HER2+ BCs.HER2+ BCs accounted for 13.8% of the patients in the discovery set and 15.8% in the validation set, which were less than the TNBC subtype.In addition, the lack of survival correlation between VWA5A gene expression and survival data according to METABRIC analysis implies that the biological role of VWA5A may be different in HER2+ BCs unlike HR+ BCs or TNBCs.HER2+ BCs are notably characterized by aggressive behavior including distant metastasis, despite advances in targeted therapies.Especially, HER2+ BCs show a propensity toward brain metastasis, and accurate prediction of this aggressive phenomenon is one of the most important unmet needs in the management of HER2+ BC.Therefore, the discovery of biomarkers for metastasis prediction and prognostication in HER2+ BC should be performed in the near future by implementing a multi-omics approach.In summary, we discovered lower expression of VWA5A as the potential biomarker for metastasis-prone BCs by high-throughput proteomics assay followed by machine learning-based protein expression analysis and feature selection.In addition to the prognostic significance assessed by METABRIC cohort analysis, the biological influence of VWA5A knock-out confirmed the tumor suppressive role of VWA5A.We suggest the clinical utility of VWA5A IHC, where further validation of VWA5A IHC adjunctive tools for prognostication should focus on the large-scale verification of the population with higher numbers of distant metastasis and HER2+ BCs.

Patient selection
In the discovery set for high-throughput proteomics analysis, 29 patients with BC diagnosed between 1998 and 2014 at Seoul National University Hospital (SNUH) with available clinical follow-up data including DFS, MFS, and OS were included.We defined the LM as the development of metastasis after 5 years from the diagnosis.
In addition, a retrospective consecutive cohort of 1002 patients with BCs who underwent surgical resection and were diagnosed between 2009 and 2012 in SNUH were recruited.Patients who received neoadjuvant chemotherapy were excluded.FFPE surgical samples archived in the Department of Pathology, SNUH, TMA were constructed.The experienced breast pathologist (H.S.R) selected the 2 mm-sized representative areas from H&E slides of the samples, and the TMAs were made (Superbiochips Laboratory, Seoul, Republic of Korea).
This study was approved by the Institutional Review Board (IRB) of SNUH, and the individual informed consent forms from the patients were waived by the decision of IRB.We confirm that all methods were performed in accordance with the relevant guidelines and regulations.www.nature.com/scientificreports/

High-throughput proteomics assay
From the FFPE blocks of surgically resected samples of the discovery set, 10 μm thick sections were deparaffinized in xylene twice.The protein uses a combination of acetone precipitation and filter-aided sample preparation (FASP) 32 .Desalted pooled peptides were fractionated using the stage tip-based high-pH peptide fractionation method 32 .The pre-fractionated peptides were analyzed on an LC-MS system with an Easy-nLC 1000 (Thermo Fisher Scientific, Waltham, MA, USA) equipped with a nanoelectrospray ion source (Thermo Fisher Scientific) and Q-Exactive mass spectrometer (Thermo Fisher Scientific).The peptide samples were separated into a trap column and an analytical column (Thermo Fisher Scientific).
Raw MS/MS files were processed with MaxQuant (version 1.6.1.0)using the Andromeda search engine against the Human Uniprot protein sequence database (December_2014, 88 657 entries).Experiment details are presented in the Supplementary Methods.

Proteomics data analysis
Preprocessing Protein levels were normalized and statistical analyses were performed by Perseus (version 1.5.8.5).ANOVA was used to determine DEP among NM, LM, and M groups.Peptide intensity data from tandem mass spectrometry (MS/MS) were processed using MaxQuant acquired.LFQ intensity was used as a protein quantification measure.The proteome dataset of breast cancer metastasis was preprocessed as follows.Only proteins that were identified in all replicates were used.Missing values were predicted by k-nearest neighbor imputation.As a result, in total 9455 proteins were quantified and resulted in 6639 proteins after filtration and imputation.The MS-based proteomics data of all identified peptides has been registered in the ProteomeXchange Consortium (http:// prote omece ntral.prote omexc hange.org) via the PRIDE partner repository (data set identifier: PXD015171).

Machine learning-based feature selection
The goal is to select proteins as features for biomarker candidates whose abundance level can discriminate metastasis types: NM, M, and LM.The sample size of each metastasis type is 9 for NM, 10 for M, and 10 for LM, respectively.
The full schematic overview of the biomarker candidate selection method is illustrated in Fig. 5 and the pseudo-code of the algorithm is in Supplementary Fig. S6.Our feature selection method operates in two hierarchical levels.The higher level is to choose protein features that are selected multiple times in the 29 LOOCV experiments.On the lower level of each of 29 LOOCV experiments, a set of features is selected in the recursive feature addition scheme.Below we explain how the algorithm works.
At a higher level (Fig. 5A, Supplementary Fig. S6; line 2-5, 23-25), we used the LOOCV scheme for biomarker candidate selection to identify protein sets whose abundance can discriminate metastasis type of unseen patients.For each CV fold, a sample was used as a held-out test set (Supplementary Fig. S6; line 4), while the other samples were used as a training set for feature selection (Supplementary Fig. S6; line 5), denoted as tr_LOOCV.In each of 29 LOOCV folds, we performed feature selection by recursive feature addition and metastasis prediction tasks in fivefold CV settings.To evaluate the selected features, SVM classifier was trained with the selected features and tested with the test set in terms of precision, recall, and F1 score.We reported macro averaged evaluation metrics of precision, recall and F1 score respectively, which is to calculate an evaluation metric for all classes individually and average them.As a result, a different feature set is selected and tested for each of 29 LOOCV folds.Thus, the final features are features that are selected multiple times, 3 times, out of 29 LOOCV folds.Since we are selecting features out of 6639 protein features that are measured by mass spectrometry experiments, this is a very stringent feature selection criteria.Statistically, we can measure p-values for features selected as features at least 3 times as 1.95 × 10 −6 .
At the lower level (Fig. 5B), in each of the 29 LOOCV folds, protein features were selected by performing recursive feature addition (Supplementary Fig. S6; line 6-22).The first step is to prioritize proteins for the recursive feature addition analysis using all samples in tr_LOOCV (Supplementary Fig. S6; line 6).Proteins were ranked in terms of the discriminative power of metastasis type measured by MI.When there were ties for multiple proteins, we used the NP method to break the ties.The second step is to select features by recursive feature addition (Supplementary Fig. S6; line 7-22).Specifically, protein features were recursively added to an SVM classifier according to the rank until the prediction performance of the classifier reached peak.The prediction performance is measured with fivefold cross-validation within a tr_LOOCV (Supplementary Fig. S6; line 10-17).The feature set where prediction performance is maximum was regarded as the final selected feature for each LOOCV fold (Supplementary Fig. S6; line 18-21).Feature prioritization step.MI is a metric for statistical dependency between two variables, the metastasis label and protein abundance group in this case.Relevance between protein abundance level and metastasis type is computed as below.
Discrete variable X ∈ {NM, M, LM} denotes metastasis type and the other discrete variable Y ∈ {L, M, H} denotes protein abundance level, while L encodes for low abundance level, M for medium abundance level, and H for high abundance level.Since raw LFQ protein abundance is a continuous variable, we transformed the variable into a multi-chotomous discrete variable X where protein abundance levels are categorized into 3 groups, where www.nature.com/scientificreports/L for low, M for medium, and H for high abundance level, respectively.To handle the label imbalance issue, we sampled 100 times with an equivalent sample size (n = 9) for each metastasis type.Averaged MI across sampling batches was used to prioritize proteins.
To prioritize features in terms of influence in a biological network, NP simulates the effects of biomarkers in a biological network.In detail, NP re-prioritized proteins according to the relevance with the seed proteins, by propagating the influence of seed proteins along the PPI given a STRING (v11.0) 33as a protein-protein interaction network.For seed selection, DEPs among the three metastasis types were detected with the ANOVA test, for proteins with adjusted p-value < 0.1 after Benjamini-Hochberg multiple test correction.For a network, STRING network was trimmed with a confidence score, where normalized edge confidence < 0.5 was filtered out.For the implementation of the network propagation algorithm, we used HotNet2 34 .Feature selection step.After feature prioritization, features were recursively added to the classification model according to the feature rank.For evaluating each iteration of the recursive feature addition, SVM with an RBF kernel was trained to discriminate the metastasis types by recursively adding features according to the rank.The SVM classifier with RBF kernel was chosen as a classifier because the model showed the highest prediction performance compared to the other classifiers (Supplementary Table S1).As a new feature was added, the prediction performance, i.e., accuracy, of the current classification model in each iteration was computed in a threefold cross-validation scheme with a training set.Accuracy is calculated as macro averaged accuracy for multiclass classification 35 .Feature addition was iteratively performed until the accuracy reached the maximum.

Cross validation using publicly available gene expression dataset
To ascertain the validity of the candidate biomarker derived from the proteomics assay, we used the publicly available gene expression dataset-METABRIC 36 -to assess the prognostic significance of biomarker expression.

Figure 1 .
Figure 1.VWA5A is prioritized as a biomarker candidate for metastasis type discrimination.(A) The t-SNE plots of the patient embeddings with the nine features selected by our method (left) and with randomly selected features.(B) Scatter plots of the nine candidates in terms of mutual information ranks and NP score ranks.Candidate biomarkers selected by our method are marked as red dots.

Figure 2 .
Figure 2. Prognostic relevance of VWA5A gene expression on disease free survivals of patients in METABRIC cohort.Higher VWA5A expression was significantly associated with favorable disease free survivals in METABRIC cohort.

Figure 3 .
Figure 3. Biological role of VWA5A assessed by in vitro functional assay.(A) Knock-out VWA5A using siRNA transfection resulted in marked decrease in the expression levels of VWA5A.(B) siRNA transfected cell lines T47D, BT20 and HCC70 showed more aggressive behaviors including increased cellular invasion and migration.

Figure 4 .
Figure 4. Expression of VWA5A in the validation set.(A) VWA5A expression was assessed by immunohistochemistry and histoscore of 50 was used to discriminate VWA5A-low and VWA5A-high groups.(B) Lower VWA5A expression of breast cancer cells were associated with advance pT stage of the patients in the validation set.

Figure 5 .
Figure 5. Schematic overview of biomarker candidate selection.(A) Leave-one-out cross-validation (LOOCV) scheme for biomarker candidate selection.For each CV fold, a sample was used as a held-out test set, while the other samples were used for feature selection.After feature selection, SVM classifier was trained with the selected features and tested with the test set.(B) Within each LOOCV training set, we prioritized proteins by mutual information (MI) network propagation (NP).Proteins were ranked in terms of the discriminative power of metastasis type measured by MI.After feature prioritization, features were added to the classification model via recursive feature selection, until the accuracy reached the maximum.

Table 1 .
Clinicopathological characteristics of the discovery set.NM no metastatsis, LM late metastasis, M metastasis, SD standard deviation, MFS metastasis free survival, IHC immunohistochemistry, TNBC triple negative breast cancer.

Table 2 .
Clinicopathological characteristics according to VWA5A expression in the validation set.SD standard deviation, HR hormonal receptor, TNBC triple negative breast cancer.