Background

Outcomes for early-stage breast cancer (BC) patients have improved over recent decades as a result of better diagnostic accuracy, targeted drug therapies, in addition to improvements in early diagnosis.1 However, the ten-year mortality rates of BC patients remain ~20% which is attributable to the development of metastasis.2 Several histopathological features have been studied as prognostic factors in BC, including tumour size, lymph node status and histological grade,3,4,5 which are strongly associated with outcome. Lymphovascular invasion (LVI) is an early event in the development of metastasis and is a potent prognostic factor.6 Although the molecular profiles associated with tumour differentiation in terms of histological type and grade and development of lymph node metastasis have been well characterised,7,8,9 the molecular mechanisms of LVI and associated genes that may represent therapeutic targets or biomarkers remain to be identified. The main challenge in determining the molecular profiles associated with LVI status in BC stems from the lack of LVI status in the available large-scale molecular studies in addition to the inherent subjectivity of morphological assessment of LVI status.

The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)10 and The Cancer Genome Atlas (TCGA)11 cohorts are currently the largest genomic and transcriptomic datasets of early-stage BC patients with clinical follow-up. In this study, using these large transcriptomic datasets combined with thorough histological assessment of LVI, we applied bioinformatic analysis to evaluate the genes associated with LVI and assessed the prognostic value of genomic subtype based on LVI status.

Methods

The METABRIC cohort

In the METABRIC study,10 mRNA was extracted from primary tumours of female patients, and mRNA expression was evaluated using the Illumina TotalPrep RNA Amplification Kit and Illumina Human HT-12 v3 Expression BeadChips (Ambion, Warrington, UK). LVI status of 1565 patients within the METABRIC cohort, which were histologically assessed using haematoxylin and eosin (H&E) stained slides. For the Nottingham subset included in METABRIC (n = 285/1565), LVI status was additionally assessed by immunohistochemistry (IHC) utilising CD31, CD34 and D2-40,12 and the final LVI status was confirmed using a combination of multiple H&E tumour sections and IHC. Considering the different methods of LVI assessment, cases were divided into two groups: (1) the Nottingham cases and (2) the remaining METABRIC cases (n = 1280). Gene transcript expression levels between LVI-positive and LVI-negative cases were compared for each group, as described in the ‘Bioinformatics analysis’ section.

The TCGA cohort

The data from the TCGA11 cohort of female BC patients (n = 854) was extracted from the Genomic Data Commons Data Portal and cBioPortal website.13,14 Briefly, the datasets of mRNA expression from RNASeqV2 were accessed along with de-identified clinical information for several clinicopathological factors and outcomes. Digital H&E-stained slides from the TCGA_BRCA cohort were accessed via the cBioPortal website, and LVI status was quantified by an expert breast pathologist (LD).

Bioinformatics analysis

Analysis of mRNA expression data from METABRIC has been previously described.10 Differentially expressed genes (DEGs) between LVI-positive and LVI-negative cases were identified using the weighted average difference (WAD) method, and the DEGs were selected according to the WAD ranking.15,16 Lists of the top 350 genes associated with LVI for the WAD assay in both (1) the Nottingham cases in the METABRIC cohort (n = 285) and (2) other METABRIC cases (n = 1280) are shown in Supplementary Tables 1 and 2. Overlapping DEGs between the two groups were included in the gene set associated with LVI.

The Cluster 3.0 package was used for clustering and heat map construction.17 Clustering analysis was performed using METABRIC data as the discovery set and validated using TCGA data as the validation set. TCGA mRNA data were log2-transformed prior to clustering analysis.

For pathway analysis, the WEB-based GEne SeT AnaLysis Toolkit (WebGestalt) was used to calculate significantly enriched gene ontologies and pathways associated with these genes.18,19 The false discovery rate was controlled using the Benjamini–Hochberg procedure in WebGestalt, with an adjusted-p < 0.01 considered statistically significant.

Statistical analysis

Statistical analyses were conducted using IBM SPSS Statistics for Windows, version 24.0 (IBM Corp., Armonk, NY, USA). The chi-squared test was used to assess differences among several clinicopathological factors, including LVI status, tumour size, lymph node status, histological grade, oestrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor 2 (HER2) and molecular subtypes, as stratified by the LVI-associated genomic subtype.

Kaplan–Meier survival curves of 10-year overall survival (OS) were plotted for the METABRIC and TCGA cohorts. The 10-year OS in this study was defined as the day of death within 10 years or the day of completing follow-up from the day of surgery. In univariate and multivariate analyses, 95% confidence intervals (CIs) were assessed using the Cox proportional hazards regression model to determine the associations between clinicopathological factors (LVI status, tumour size, lymph node status, histological grade, ER, PR and HER2), including the LVI-associated genomic subtype and prognosis.

Results

Clinicopathological and prognostic significance of LVI status

In the METABRIC cohort, 635/1,565 (41%) were LVI-positive and 930 (59%) were LVI-negative. The LVI-positivity rate was 41.1% (117/285) in the Nottingham cases and 40.5% (518/1,280) in the remaining METABRIC cases. In the TCGA cohort, 295/854 (35%) patients were LVI-positive and 559 (65%) were LVI-negative. In both cohorts, LVI positivity was significantly associated with large tumour size (METABRIC: p< 0.0001; TCGA: p= 0.00055), positive nodal status (METABRIC and TCGA: both p< 0.0001) and high histological grade (METABRIC and TCGA: both p< 0.0001; Supplementary Table 3).

The survival of LVI-positive BC patients was significantly worse compared with LVI-negative patients in the METABRIC (hazard ratio [HR] 1.70, 95% CI 1.45–2.01, p< 0.0001; Fig. 1a) and TCGA cohorts (HR 2.2, 95% CI 1.46–3.38, p= 0.00019; Fig. 1b). Univariate and multivariate analyses of both METABRIC and TCGA datasets are summarised in Supplementary Table 4. Univariate analysis using the Cox proportional hazards regression model identified LVI-positive status, large tumour size (METABRIC: HR 1.82, 95% CI 1.49–2.21, p< 0.0001; TCGA: HR 1.81, 95% CI 1.08–3.04, p= 0.025), positive nodal status (METABRIC: HR 2.06, 95% CI 1.74–2.44, p< 0.0001; TCGA: HR 1.85, 95% CI 1.20–2.85, p= 0.0056), negative ER status (METABRIC: HR 1.66, 95% CI 1.38–1.99, p< 0.0001; TCGA: HR 1.89, 95% CI 1.19–2.98, p= 0.0065) and negative PR status (METABRIC: HR 1.67, 95% CI 1.42–1.98, p< 0.0001; TCGA: HR 1.68, 95% CI 1.08–2.61, p= 0.020) as poor prognostic factors in both cohorts. In addition, significant prognostic factors included high histological grade (HR 1.63, 95% CI 1.37–1.93, p< 0.0001) and positive HER2 status (HR 1.92, 95% CI 1.54–2.38, p< 0.0001) in the METABRIC cohort. LVI positivity was an independent poor prognostic factor in multivariate analysis (METABRIC: HR 1.29, 95% CI 1.07–1.56, p= 0.0073; TCGA: HR 2.19, 95% CI 1.32–3.62, p= 0.0023; Supplementary Table 4).

Fig. 1
figure 1

Cumulative survival of BC patients stratified by LVI status. a Ten-year overall survival in the METABRIC cases was significantly worse in the LVI-positive group than in the LVI-negative group. b In TCGA cases, significant differences were noted in patient overall survival in the LVI-positive and LVI-negative groups. Cumulative survival of breast cancer patients stratified by LVI-related genomic subtypes. c Ten-year overall survival in breast cancer patients with LVI-related genomic subtypes. Subtype 2 was significantly worse compared with subtype 1 in the METABRIC cohort. d Classification of LVI-related genomic subtype was a significant prognostic factor in the TCGA cohort

Genes associated with LVI

The overlapping DEGs between (1) the Nottingham cases in the METABRIC cohort (n = 285) and (2) remaining METABRIC cases (n = 1280) included 42 significantly overexpressed and 57 downregulated genes (Table 1, Supplementary Tables 5 and 6).

Table 1 List of 99 genes significantly associated with lymphovascular invasion

The 99 genes in the LVI-related set were significantly associated with gene ontologies, including ‘GO: 0005615 Extracellular space’, ‘GO: 0072562 Blood microparticle’ and ‘GO: 0031012 Extracellular matrix’ (Table 2). All significant pathways existed in the category ‘Cellular component’ of gene ontology (Supplementary Fig. 1).

Table 2 Gene ontology pathways significantly associated with 99 genes related to lymphovascular invasion

Hierarchical clustering was used to further analyse these 99 genes based on similarity in expression (Fig. 2a). Clustering in the discovery (METABRIC) cohort classified cases into two subtypes, namely, subtypes 1 (n = 738 cases; 45%) and 2 (n = 827; 55%) (Fig. 2b). The dendrogram of METABRIC cases, in which the pattern of the branches indicates the relationship for each case, is shown in Supplementary Fig. 2.

Fig. 2
figure 2

Cluster analysis of the gene set associated with LVI. a The dendrogram of 99 LVI-related genes using METABRIC cohort, in which the pattern of the branches indicates the relationship for each gene. Heat maps in accordance with the LVI-related gene set for the b METEBRIC and c TCGA cohorts showed that all cases were clearly divided between subtypes 1 and 2 using cluster analysis

To validate these results, hierarchical clustering was conducted on the TCGA cohort using the same 99 genes. The dendrogram classifying these 854 cases is shown in Supplementary Fig. 3, again showing the cases split into two groups: subtypes 1 and 2, with 263 (31%) and 591 (69%) cases, respectively (Fig. 2c).

In both cohorts, LVI positivity was significantly more prevalent in subtype 2 tumours than those of subtype 1 (METABRIC and TCGA: p< 0.0001; Table 3).

Table 3 Clinicopathological significance of genomic subtypes related to lymphovascular invasion

Clinicopathological and prognostic significance of the LVI-related gene sets

In the METABRIC and TCGA cohorts, subtype 2 was significantly associated with large tumour size (both p< 0.0001), high histological grade (both p< 0.0001), ER negativity (both p< 0.0001), PR negativity (both p< 0.0001) and HER2 positivity (both p< 0.0001; Table 3). Interestingly, 69% of luminal B, 95% HER2-enriched and 90% basal-like BC were classified as subtype 2 in the METABRIC cohort.

Patients with LVI-related subtype 2 had a significantly worse prognosis compared with those presenting with subtype 1 tumours in both cohorts (METABRIC: HR 1.78, 95% CI 1.50–2.12, p< 0.0001; TCGA: HR 2.32, 95% CI 1.35–3.99, p= 0.0023; Fig. 1c, d). In multivariate survival analysis, the LVI-related genomic subtype was an independent poor prognostic factor in both cohorts (METABRIC: HR 1.32, 95% CI 1.07–1.63, p= 0.0098; TCGA: HR 2.76, 95% CI 1.19–6.38, p= 0.018; Fig. 3 and Supplementary Table 7).

Fig. 3
figure 3

Survival analysis based on clinicopathological characteristics including LVI-related genomic subtype. Forest plots showing the hazard ratios and 95% CI of the multivariate survival analyses in a the METABRIC cohort and b the TCGA cohort. The LVI-related genomic subtype was an independent prognostic factor in both cohorts

Discussion

In this study, we identified a 99-gene set significantly associated with LVI status in the METABRIC dataset. We validated this finding using the TCGA dataset. LVI is a biomarker for aggressive BC and is considered predictive for metastasis.20 In other cancer types, gene sets associated with vascular invasion have been previously described, for example in hepatocellular carcinoma21 and endometrial cancer.22 Mannelqvist et al.23 suggested that an 18-gene set associated with vascular invasion in endometrial cancer22 was consistently associated with hormone receptor negativity, HER2 positivity, basal-like phenotype, reduced patient survival in BC patients. In line with these findings, the present study found that 69% of luminal B, 95% HER2-enriched and 90% basal-like BCs were subtype 2 in the METABRIC cohort. Subtype 2 was significantly associated with LVI positivity. However, of the 18 genes identified in Mannelqvist et al., only different isoforms of matrix metallopeptidase (MMP) and serpin family E member (SERPINE) were present in our 99-gene set.

The underlying molecular mechanisms driving LVI in BC, which are potential therapeutic targets, have yet to be identified. The 99 genes in the LVI-related gene signature from this study are significantly associated with extracellular pathways. In previous work, Klahan et al.24 suggested their gene set associated with LVI was related to extracellular matrix components using microarray data from 108 BC patients. Epithelial–mesenchymal transition (EMT)-implicated genes in prostate cancer have also been associated with pathways relating to the extracellular space.25 The extracellular matrix comprises a network of structural proteins, and reorganisation of this matrix is required for cancer to progress.26 The EMT is thought to play an important role in the process of metastasis to distant sites, and certain EMT markers are related to LVI status in BC.12 In the 99 gene LVI signature set, there are several genes associated with extracellular pathways that are implicated in BC prognosis. For example, heat shock protein 27 (HSPB1), is associated with BC aggressiveness and metastasis.27 HSPB1 expression is upregulated in the early phase of cell differentiation, which implies that HSPB1 may play an important role in controlling the growth and migration of cancer stem-like cells.28 Another example is apolipoprotein C1 (APOC1), which is considered as a prognostic biomarker for triple-negative BC.29 APOC1 is thought to regulate the inflammatory response in cancer tissues,30 which may be closely related to the elimination of proliferating cancer cells.31 Upregulation of MMPs is also related to cancer cell proliferation, invasion and epithelial-to-mesenchymal transformation and is indicative of a poor prognosis for BC patients.32 As an example, MMP-11, which belongs to the MMP family, promotes BC development by inhibiting apoptosis as well as enhancing the migration and invasion of BC cells.33 Additional functional studies of these genes are necessary to explore the association of aberrant gene function and proteins related to LVI in BC.

Comparison of the METABRIC and TCGA cohorts was a limiting factor in this study, in terms of the different methods used to quantify and statistically analyse gene expression and in the approaches to LVI evaluation. We previously developed a method for the accurate detection of LVI using immunostaining for CD34 or D2-40.12 In the Nottingham cases, we evaluated LVI status using strict criteria based on both morphology and immunohistochemistry. However, for the TCGA BRCA cohort, we evaluated LVI status using H&E-stained slides alone from the cBioPortal database. Although LVI evaluation using only one H&E slide is feasible, it may be difficult to clearly identify LVI negativity.34 In present study, the LVI-positivity rates were closely similar between the Nottingham cases, the remaining METABRIC cases and TCGA_BRCA cases using the different LVI-evaluations. Although our results might suggest the adequacy of LVI evaluation with only one H&E-stained slide, further analysis with the larger cohorts to assess the LVI status using both H&E and IHC slides is necessary to report accurately on LVI status.

Microarrays were used to evaluate mRNA expression in the METABRIC analysis. In contrast, RNA-seq using NGS was used in the TCGA analysis. Microarray platforms have been used and validated for nearly two decades, and this approach has been widely used for evaluating multi-gene expression. Conversely, the unbiased genome-wide RNA-seq method allows for the analysis of all annotated transcripts in addition to the identification of novel transcripts, splice junctions and noncoding RNAs. These technological and methodological differences may underpin the known challenges of relating microarray and RNA sequencing data between studies.35,36 For example, the different approaches can have different lower limits of detection or may encompass different genomic regions. Thus, we cannot assume that the methods are interchangeable, and doing so would require rigorous cross-assay comparisons.37 Although there is statistical agreement across the different cohorts in the present study, further analysis using identical technologies (microarray and/or NGS assays) may provide clearer validation of the LVI gene signature.

In conclusion, we have confirmed the suitability and prognostic significance of our LVI-evaluation approach using the METABRIC and TCGA cohorts. We have determined genomic subtype associated with LVI status and patient outcome in BC, therefore, providing an experimental tool which may serve to unravel the complex gene networks associated with LVI with potential clinical relevance. Consistency between clinical cohorts stratified by LVI-gene signature may be further improved by using the same definitions and evaluation methods for LVI status.