A key genomic subtype associated with lymphovascular invasion in invasive breast cancer

Lymphovascular invasion (LVI) is associated with the development of metastasis in invasive breast cancer (BC). However, the complex molecular mechanisms of LVI, which overlap with other oncogenic pathways, remain unclear. This study, using available large transcriptomic datasets, aims to identify genes associated with LVI in early-stage BC patients. Gene expression data from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort (n = 1565) was used as a discovery dataset, and The Cancer Genome Atlas (TCGA; n = 854) cohort was used as a validation dataset. Key genes were identified on the basis of differential mRNA expression with respect to LVI status as characterised by histological review. The relationships among LVI-associated genomic subtype, clinicopathological features and patient outcomes were explored. A 99-gene set was identified that demonstrated significantly different expression between LVI-positive and LVI-negative cases. Clustering analysis with this gene set further divided cases into two molecular subtypes (subtypes 1 and 2), which were significantly associated with pathology-determined LVI status in both cohorts. The 10-year overall survival of subtype 2 was significantly worse than that of subtype 1. This study demonstrates that LVI in BC is associated with a specific transcriptomic profile with potential prognostic value.


BACKGROUND
Outcomes for early-stage breast cancer (BC) patients have improved over recent decades as a result of better diagnostic accuracy, targeted drug therapies, in addition to improvements in early diagnosis. 1 However, the ten-year mortality rates of BC patients remain~20% which is attributable to the development of metastasis. 2 Several histopathological features have been studied as prognostic factors in BC, including tumour size, lymph node status and histological grade, [3][4][5] which are strongly associated with outcome. Lymphovascular invasion (LVI) is an early event in the development of metastasis and is a potent prognostic factor. 6 Although the molecular profiles associated with tumour differentiation in terms of histological type and grade and development of lymph node metastasis have been well characterised, 7-9 the molecular mechanisms of LVI and associated genes that may represent therapeutic targets or biomarkers remain to be identified. The main challenge in determining the molecular profiles associated with LVI status in BC stems from the lack of LVI status in the available large-scale molecular studies in addition to the inherent subjectivity of morphological assessment of LVI status.
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) 10 and The Cancer Genome Atlas (TCGA) 11 cohorts are currently the largest genomic and transcriptomic datasets of early-stage BC patients with clinical follow-up. In this study, using these large transcriptomic datasets combined with thorough histological assessment of LVI, we applied bioinformatic analysis to evaluate the genes associated with LVI and assessed the prognostic value of genomic subtype based on LVI status.

METHODS
The METABRIC cohort In the METABRIC study, 10 mRNA was extracted from primary tumours of female patients, and mRNA expression was evaluated using the Illumina TotalPrep RNA Amplification Kit and Illumina Human HT-12 v3 Expression BeadChips (Ambion, Warrington, UK). LVI status of 1565 patients within the METABRIC cohort, which were histologically assessed using haematoxylin and eosin (H&E) stained slides. For the Nottingham subset included in METABRIC (n = 285/1565), LVI status was additionally assessed by immunohistochemistry (IHC) utilising CD31, CD34 and D2-40, 12 and the final LVI status was confirmed using a combination of multiple H&E tumour sections and IHC. Considering the different methods of LVI assessment, cases were divided into two groups: (1) the Nottingham cases and (2) the remaining METABRIC cases (n = 1280). Gene transcript expression levels between LVI-positive and LVI-negative cases were compared for each group, as described in the 'Bioinformatics analysis' section.
The TCGA cohort The data from the TCGA 11 cohort of female BC patients (n = 854) was extracted from the Genomic Data Commons Data Portal and cBioPortal website. 13,14 Briefly, the datasets of mRNA expression from RNASeqV2 were accessed along with de-identified clinical information for several clinicopathological factors and outcomes. Digital H&E-stained slides from the TCGA_BRCA cohort were accessed via the cBioPortal website, and LVI status was quantified by an expert breast pathologist (LD).

Bioinformatics analysis
Analysis of mRNA expression data from METABRIC has been previously described. 10 Differentially expressed genes (DEGs) between LVI-positive and LVI-negative cases were identified using the weighted average difference (WAD) method, and the DEGs were selected according to the WAD ranking. 15,16 Lists of the top 350 genes associated with LVI for the WAD assay in both (1) the Nottingham cases in the METABRIC cohort (n = 285) and (2) other METABRIC cases (n = 1280) are shown in Supplementary Tables 1 and 2. Overlapping DEGs between the two groups were included in the gene set associated with LVI.
The Cluster 3.0 package was used for clustering and heat map construction. 17 Clustering analysis was performed using METABRIC data as the discovery set and validated using TCGA data as the validation set. TCGA mRNA data were log2-transformed prior to clustering analysis.
For pathway analysis, the WEB-based GEne SeT AnaLysis Toolkit (WebGestalt) was used to calculate significantly enriched gene ontologies and pathways associated with these genes. 18,19 The false discovery rate was controlled using the Benjamini-Hochberg procedure in WebGestalt, with an adjusted-p < 0.01 considered statistically significant.
Statistical analysis Statistical analyses were conducted using IBM SPSS Statistics for Windows, version 24.0 (IBM Corp., Armonk, NY, USA). The chisquared test was used to assess differences among several clinicopathological factors, including LVI status, tumour size, lymph node status, histological grade, oestrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor 2 (HER2) and molecular subtypes, as stratified by the LVI-associated genomic subtype.
Kaplan-Meier survival curves of 10-year overall survival (OS) were plotted for the METABRIC and TCGA cohorts. The 10-year OS in this study was defined as the day of death within 10 years or the day of completing follow-up from the day of surgery. In univariate and multivariate analyses, 95% confidence intervals (CIs) were assessed using the Cox proportional hazards regression model to determine the associations between clinicopathological factors (LVI status, tumour size, lymph node status, histological grade, ER, PR and HER2), including the LVIassociated genomic subtype and prognosis.
The 99 genes in the LVI-related set were significantly associated with gene ontologies, including 'GO: 0005615 Extracellular space', 'GO: 0072562 Blood microparticle' and 'GO: 0031012 Extracellular matrix' ( Table 2). All significant pathways existed in the category 'Cellular component' of gene ontology ( Supplementary Fig. 1).
Hierarchical clustering was used to further analyse these 99 genes based on similarity in expression (Fig. 2a). Clustering in the discovery (METABRIC) cohort classified cases into two subtypes, namely, subtypes 1 (n = 738 cases; 45%) and 2 (n = 827; 55%) ( Fig. 2b). The dendrogram of METABRIC cases, in which the pattern of the branches indicates the relationship for each case, is shown in Supplementary Fig. 2.
To validate these results, hierarchical clustering was conducted on the TCGA cohort using the same 99 genes. The dendrogram classifying these 854 cases is shown in Supplementary Fig. 3, again showing the cases split into two groups: subtypes 1 and 2, with 263 (31%) and 591 (69%) cases, respectively (Fig. 2c).
In both cohorts, LVI positivity was significantly more prevalent in subtype 2 tumours than those of subtype 1 (METABRIC and TCGA: p < 0.0001; Table 3).

DISCUSSION
In this study, we identified a 99-gene set significantly associated with LVI status in the METABRIC dataset. We validated this finding using the TCGA dataset. LVI is a biomarker for aggressive BC and is considered predictive for metastasis. 20 In other cancer types, gene sets associated with vascular invasion have been previously described, for example in hepatocellular carcinoma 21 and endometrial cancer. 22 Mannelqvist et al. 23 suggested that an 18-gene set associated with vascular invasion in endometrial cancer 22 was consistently associated with hormone receptor negativity, HER2 positivity, basal-like phenotype, reduced patient survival in BC patients. In line with these findings, the present study found that 69% of luminal B, 95% HER2-enriched and 90% basal-like BCs were subtype 2 in the METABRIC cohort. Subtype 2 was significantly associated with LVI positivity. However, of the 18 genes identified in Mannelqvist et al., only different isoforms of matrix metallopeptidase (MMP) and serpin family E member (SERPINE) were present in our 99-gene set.
The underlying molecular mechanisms driving LVI in BC, which are potential therapeutic targets, have yet to be identified. The 99 genes in the LVI-related gene signature from this study are significantly associated with extracellular pathways. In previous work, Klahan et al. 24 suggested their gene set associated with LVI was related to extracellular matrix components using microarray data from 108 BC patients. Epithelial-mesenchymal transition (EMT)-implicated genes in prostate cancer have also been associated with pathways relating to the extracellular space. 25     Fig. 1 Cumulative survival of BC patients stratified by LVI status. a Ten-year overall survival in the METABRIC cases was significantly worse in the LVI-positive group than in the LVI-negative group. b In TCGA cases, significant differences were noted in patient overall survival in the LVI-positive and LVI-negative groups. Cumulative survival of breast cancer patients stratified by LVI-related genomic subtypes. c Ten-year overall survival in breast cancer patients with LVI-related genomic subtypes. Subtype 2 was significantly worse compared with subtype 1 in the METABRIC cohort. d Classification of LVI-related genomic subtype was a significant prognostic factor in the TCGA cohort The extracellular matrix comprises a network of structural proteins, and reorganisation of this matrix is required for cancer to progress. 26 The EMT is thought to play an important role in the process of metastasis to distant sites, and certain EMT markers are related to LVI status in BC. 12 In the 99 gene LVI signature set, there are several genes associated with extracellular pathways that are implicated in BC prognosis. For example, heat shock protein 27 (HSPB1), is associated with BC aggressiveness and metastasis. 27 HSPB1 expression is upregulated in the early phase of cell differentiation, which implies that HSPB1 may play an important role in controlling the growth and migration of cancer stem-like cells. 28 Another example is apolipoprotein C1 (APOC1), which is considered as a prognostic biomarker for triple-negative BC. 29 APOC1 is thought to regulate the inflammatory response in cancer tissues, 30 which may be closely related to the elimination of proliferating cancer cells. 31 Upregulation of MMPs is also related to cancer cell proliferation, invasion and epithelial-to-mesenchymal transformation and is indicative of a poor prognosis for BC patients. 32 As an example, MMP-11, which belongs to the MMP family, promotes BC development by inhibiting apoptosis as well as enhancing the migration and invasion of BC cells. 33 Additional functional studies of these genes are necessary to explore the association of aberrant gene function and proteins related to LVI in BC. Comparison of the METABRIC and TCGA cohorts was a limiting factor in this study, in terms of the different methods used to quantify and statistically analyse gene expression and in the approaches to LVI evaluation. We previously developed a method for the accurate detection of LVI using immunostaining for CD34 or D2-40. 12 In the Nottingham cases, we evaluated LVI status using strict criteria based on both morphology and immunohistochemistry. However, for the TCGA BRCA cohort, we evaluated LVI status using H&E-stained slides alone from the cBioPortal database. Although LVI evaluation using only one H&E slide is feasible, it may be difficult to clearly identify LVI negativity. 34 In present study, the LVI-positivity rates were  A key genomic subtype associated with lymphovascular invasion in invasive. . . S Kurozumi et al. closely similar between the Nottingham cases, the remaining METABRIC cases and TCGA_BRCA cases using the different LVIevaluations. Although our results might suggest the adequacy of LVI evaluation with only one H&E-stained slide, further analysis with the larger cohorts to assess the LVI status using both H&E and IHC slides is necessary to report accurately on LVI status. Microarrays were used to evaluate mRNA expression in the METABRIC analysis. In contrast, RNA-seq using NGS was used in the TCGA analysis. Microarray platforms have been used and validated for nearly two decades, and this approach has been widely used for evaluating multi-gene expression. Conversely, the unbiased genome-wide RNA-seq method allows for the analysis of all annotated transcripts in addition to the identification of novel transcripts, splice junctions and noncoding RNAs. These technological and methodological differences may underpin the known challenges of relating microarray and RNA sequencing data between studies. 35,36 For example, the different approaches can have different lower limits of detection or may encompass different genomic regions. Thus, we cannot assume that the methods are interchangeable, and doing so would require rigorous cross-assay comparisons. 37 Although there is statistical agreement across the different cohorts in the present study, further analysis using identical technologies (microarray and/or NGS assays) may provide clearer validation of the LVI gene signature.
In conclusion, we have confirmed the suitability and prognostic significance of our LVI-evaluation approach using the METABRIC and TCGA cohorts. We have determined genomic subtype associated with LVI status and patient outcome in BC, therefore, providing an experimental tool which may serve to unravel the complex gene networks associated with LVI with potential clinical relevance. Consistency between clinical cohorts stratified by LVI-gene signature may be further  IL17RB  MGP  VTCN1  NDP  SLC44A1  FCGBP  SELENOM  SLC40A1  CYP4X1  CLIC6  STC2  SUSD3  FGD3  NINJ1  SERPINA3  CFB  PYCARD  MT1E  ZBTB20  CXCL14  FCER1A  MFAP4  FBLN1  DCN  CXCL12  SRPX  GAS1  PDGFRL  VIM  ANXA1  C1S  DPYSL2  SERPINE2  SGCE  TPM2  ACTG2  DKK3  FST  CFD  TXNIP  HBB  HBA2  FOS  DUSP1  CYBRD1  ANG  MAOA  CEBPD  RPL3  EEF1B2  PLGRKT  APOE  APOC1  HLA-DQA1  S100A4  UBD  ELF3  S100P  GSTP1  CALML5  KRT7  PITX1  MX1  ISG15  IFI27  LY6E  NME1  SLC52A2  PTTG1  CDCA5  CCNB2  UBE2C  UBE2S  HMGA1  NOP56  YWHAZ  GNAS  IDH2  HMGB3  LAPTM4B  UCP2  MMP11  TNS3  SCD  PGAP3  ERBB2  SPDEF  LRRC26  TM7SF2  KRT8  KRT18  KRT18P55  SLC9A3R1  HSPB1  KRT19  COX6C  EEF1A2  DNAJA4   Subtype 1  Subtype 2 Subtype 2 Fig. 2 Cluster analysis of the gene set associated with LVI. a The dendrogram of 99 LVI-related genes using METABRIC cohort, in which the pattern of the branches indicates the relationship for each gene. Heat maps in accordance with the LVI-related gene set for the b METEBRIC and c TCGA cohorts showed that all cases were clearly divided between subtypes 1 and 2 using cluster analysis  Fig. 3 Survival analysis based on clinicopathological characteristics including LVI-related genomic subtype. Forest plots showing the hazard ratios and 95% CI of the multivariate survival analyses in a the METABRIC cohort and b the TCGA cohort. The LVI-related genomic subtype was an independent prognostic factor in both cohorts