Main

Colorectal cancer remains one of the leading causes of cancer mortality worldwide (Jemal et al, 2009). For rectal cancer (RC) patients with regional lymph node metastasis (RLNM), preoperative combined-modality treatment displays favourable therapeutic outcomes, including downsizing node-stage, reduction in local recurrence and treatment-associated toxicity (Minsky et al, 1992; Leonard et al, 2005; Sebag-Montefiore et al, 2009). Given its pivotal role in guiding the therapeutic selection, accurate assessment of RLNM status become an urgent step prior to the administration of preoperative therapeutic regimen. Although imaging modalities, including endoscopic ultrasound (EUS), magnetic resonance imaging (MRI) and computed tomography (CT), have been widely used in clinical practice to estimate the RLNM status, the sensitivities and specificities for accurately evaluating node involvement were, however, less than 70% and 80%, respectively (Bipat et al, 2004). In addition, use of size and morphological criteria by current imaging systems may lead to misdiagnose, particularly when the nodes <5 mm in size with smooth contour would greatly lowered the predictive sensitivity, specificity and accuracy (Liang et al, 1999). Clearly, to develop a more accurate and clinical adaptable approach to predict RLNM status will be of great benefit for RC patients from preoperative therapy.

Epithelial–mesenchymal transition (EMT) is a process whereby epithelial cells lose cell-to-cell adhesion and gain mesenchymal phenotype characterised by repression of membrane proteins such as E-cadherin and β-catenin, and overexpression of nuclear β-catenin, N-cadherin, Snail and Twist (Brabletz et al, 2001; Blanco et al, 2002; Thiery, 2002; Peinado et al, 2007). EMT endows cells with migratory and invasive properties, thus promotes tumour cells to migrate to regional lymph node or distant organ. Altered expression of EMT-related biomarker p120 catenin from membrane to cytoplasm, for example, correlated with a shortened overall survival time, elevated tumour stage and lymph node metastasis in colon cancer (Bellovin et al, 2005). Similarly, overexpression of EMT inducer molecule Nanog upregulated Snail and Slug activity, promoted the cellular motility and predicted a poor prognosis in colorectal cancer (Meng et al, 2010). Beyond the molecular markers, tumour-specific biomarkers (cancer antigen (CA) 199 and carcinoembryonic antigen (CEA)) and clinicopathological parameters had been reported to predict lymph node metastasis in colorectal cancer (Mroczko et al, 2007; Fang et al, 2009). However, it was admitted that the single marker, which reflected the cancer phenotype from only one specific aspect, was hard to predict tumour progression individually (McShane et al, 2005). Hence, to develop a method that can integrate the biomarkers in a more widen scope might achieve the goal of accurately predicting RLNM status in RC patients.

Recently, data mining technology, such as decision tree (Dietzel et al, 2010), neurofuzzy model (Catto et al, 2009), artificial neural networks (Anderson et al, 2010), logistic regressions (Vermeulen et al, 2009), Bayesian network (Stojadinovic et al, 2010) and support vector machines (SVM) (Zhu et al, 2009), have been applied to predict tumour progression and clinical outcome by integrating multigene and/or clinicopathological features. Support vector machine algorithms method, for example, performs a higher power in two categories classification, displaying strong predictive power in diagnosing lymph node metastasis for breast cancer and cutaneous melanoma (Mocellin et al, 2006; Sattlecker et al, 2010). Thus, multi-markers incorporated SVM algorithms might shed the bright light on the way to personalised prediction of RLNM status in RC.

In the present study, we analysed the expression levels of seven EMT signalling biomarkers by immunohistochemistry (IHC), as well as demographical, clinicopathological and serological variables in 193 RC patients. In the training set (74 cases), the robust variables that had high predictive ability in predicting RLNM status were subjected to SVM model construction. Furthermore, we validated the predictive value of our SVM model in the testing set (119 cases). Our study revealed that SVM model had powerful predictive value in refining RLNM status, providing a novel approach to tailored selection of RC patients for neoadjuvant therapy.

Patients and methods

Patient selection

One hundred and ninety-three pathologically confirmed RC patients with or without RLNM were recruited from April 2007 to December 2009 in Gastrointestinal Institute of Sun Yat-sen University, the Sixth Affiliated Hospital of Sun Yat-sen University. None of these patients received neo-adjuvant chemo- or chemoradiotherapy. Of which, 74 patients were randomly assigned to the training set and the rest of 119 patients were assigned to the testing set for validating the predictive power of the SVM model. Detailed, 37 patients (50.0%) in the training set and 82 patients (68.9%) in the testing set were pathologically confirmed with RLNM. The clinic-serologic variables, including gender, age, tumour stage, CEA, CA19-9 and CA125, were all obtained preoperatively. The patients were classified according to the 7th edition of the American Joint Committee on Cancer staging system (Takahashi et al, 2007). The detailed characteristics of the two cohorts were listed in Table 1. Our study was approved by Clinical Ethics Review Committee at the Sixth Affiliated Hospital of Sun Yat-sen University and the clinical consent was obtained from all the patients.

Table 1 Patient characteristics

Tissue microarrays (TMAs)

The TMAs of 193 RC tumour specimens and additional 20 normal tissues were collected from the Tissue Bank at the Gastrointestinal Institute of Sun Yat-sen University, the Sixth Affiliated Hospital, Sun Yat-sen University. As previously reported that EMT occurred at the invasive front of colorectal adenocarcinoma (Brabletz et al, 2001), we re-reviewed hematoxylin and eosin (H&E)-stained slides and selected the marginal area of tumour for TMAs design. For each case, three cores (1 mm diameter) were taken from the selected marginal region of the tumour and additional two cores (1 mm diameter) were taken from histologically confirmed normal adjacent colorectal mucosa to construct the TMAs using Tissue Array (Alphelys, MINIPORE, Plaisir, France). Briefly, the tissue cylinders were taken from the selected region of each donor tissue block and deposited into a recipient block. The recipient blocks were further performed by H&E staining to verify the adequacy of the tumour and normal tissues.

Immunohistochemistry staining

The TMAs were sectioned at 4 μm intervals, deparaffinised three times in xylene for 30 min and rehydrated with graded alcohols (100% ethyl alcohol for 5 min, 95% ethyl alcohol for 3 min and 75% ethyl alcohol for 3 min). Sections were then heated in antigen retrieval solution (sodium citrate, pH 6.0) in microwave for 15 min, incubated in H2O2 for 10 min and blocked with 1% bovine serum albumin (diluted in PBS buffer) for 10 min at room temperature. Further, the sections were incubated with the 100 μl primary antibodies against epithelial markers E-caderin (rabbit monoclonal, clone 24E10, Cell Signaling, #3195, Danvers, MA, USA, diluted 1 : 50) and β-catenin (mouse monoclonal, clone 5H10, Mlllipore, MAB2081, Billerica, MA, USA, diluted 1 : 100), mesenchymal cell marker N-cadherin (mouse monoclonal, clone 13A9, Upstate, 05-915, Waltham, MA, USA, diluted 1 : 100), transcriptional markers Snail (rabbit polyclonal, clone L70G2, Abcam, ab70983, San Francisco, CA, USA, diluted 1 : 200) and Twist (rabbit polyclonal, clone O-20, Santa Cruz, sc-102032, CA, USA, diluted 1 : 200), at 4°C overnight. All of the primary antibodies were diluted with DAKO antibody diluent solution. Furthermore, the sections were washed in PBS buffer three times for 5 min, treated with 100 μl secondary antibody for 30 min at room temperature and stained with DAB until the brown granules appeared in membrane, cytoplasm or nucleus according to the manufacturer's instructions. The sections were then washed in PBS buffer for 10 min. Finally, the sections were counterstained with hematoxylin for 2 min at room temperature, washed in dH20 3 times for 5 min and mounted the coverslips. The normal epithelium in each section served as the internal negative (N-cadherin, Snail and Twist) or positive (E-cadherin and β-catenin) control. The ganglion cells of the myenteric plexus were used as positive control for N-cadherin reactivity. The RC samples that expressed strong positive staining of Snail or Twist with lowest background were used as positive control for Snail and Twist, respectively. Meanwhile, a negative control was also employed by changing the specific primary antibody with non-immune serum immunoglobulins at the 1 : 200 dilution.

Semi-quantitative evaluation of IHC staining

The level of the immunoreaction was semi-quantitatively scored by assessing staining intensity and extent as we previously reported (Wan et al, 2010). We scored the staining intensity as following: negative (score 0), bordering (score 1), weak (score 2), moderate (score 3) and strong (score 4). Staining extent was graded into five parts according to the percentage of staining cells in the field: negative (score 0), 0–25% (score 1), 26–50% (score 2), 51–75% (score 3) and 76–100% (score 4). The merged overall score (staining intensity score plus percentage of staining cells score) was subjected to further analysis. For nuclear β-catenin, a positive or negative score was given for each tissue sample when 5% or <5% of tumour cells showed nuclear staining, respectively (Baumgart et al, 2007). Immunohistochemistry staining was scored independently by two pathologists (Y Huang and XJ Fan) blindly to clinical data.

Selection of cutoff score for ‘positive’ expression of variables

The receiver operating characteristic (ROC) curve analysis was used to select cutoff scores of all candidate variables in the training sets. At each score, the sensitivity and specificity being studied was plotted to generate a ROC curve. Briefly, the score localised closest to the point with both maximum sensitivity and specificity, the point (0.0, 1.0) on the curve, was chosen as the cutoff score, leading to the greatest number of patients classified as with or without RLNM. According to the ROC curve analysis, SVM model and all the candidate variables were dichotomised: SVM (positive RLNM vs negative RLNM), gender (male vs female), age (62.5 vs <62.5 years), tumour stage (T3+T4 vs T1+T2), CEA (3.90 vs <3.90), CA19-9 (13.35 vs <13.35), CA125 (10.00 vs <10.00), β-catenin (nucleus, positive vs negative) and the other EMT-related biomarkers (high level vs low level).

The RLNM status prediction by SVM model

The SVM model, coded by Matlab software (MathWorks, Natick, MA, USA), was used to predict the RLNM status. Firstly, we selected the variables that had high power in predicting RLNM status, from all the candidate variables by SVM method and ROC analysis. Secondly, we designed and trained our SVM model by integrating the selected variables in the training set. After the completion of the training process, the algorithmic SVM model would be ‘fixed’ for further running. The detailed steps of the SVM model construction were shown in Supplementary Information. In the testing set, the ‘feature’ of the selected variables in each patient would be input into the SVM model. Finally, the RLNM status of each patient would be predicted and output as 0 (without RLNM) or 1 (with RLNM) by our SVM model. The output results of each patient would be subjected to further univariate and multivariate analysis.

Statistical analysis

The correlations between expression levels of EMT-related biomarkers and RLNM status was evaluated by chi-suqare test. The univariate and multivariate analyses were performed by binary logistic regression model to estimate the odds ratio (OR) and 95% confidence interval (95% CI). This study was designed with 80% power (two-sided α level of 0.05) to construct the SVM prediction model. All P-values quoted were two-sided and P<0.05 was considered statistically significant. Statistical analysis was performed using SPSS v. 17.0 (SPSS, Inc., Chicago, IL, USA).

Results

EMT-related biomarkers expression in RC

We investigated the expression levels of EMT-related biomarkers in the training set (n=74) and testing set (n=119) by IHC analysis. In normal epithelia, epithelial markers E-cadherin and β-catenin showed strong membranous expression (Figure 1a1 and 1a2). Conversely, mesenchymal marker N-cadherin, as well as transcriptional factors Snail and Twist were failed to be detected (Figure 1a3–1a5). In RC tumour tissues with negative RLNM, membranous markers of E-cadherin and β-catenin were reduced expression in both training set and testing set (Figure 1b1 and 1b2). Meanwhile, mesenchymal marker N-cadherin, as well as transcriptional factors Snail and Twist showed weak expression (Figure 1b3–1b5). In RC tumour tissues with positive RLNM, E-cadherin and β-catenin were lost expression (Figure 1c1 and 1c2). Whereas, N-cadherin, Snail and Twist showed strong expression (Figure 1c3–1c5). Moreover, chi-square test showed that only Snail expression level was evidently varied between RLNM positive and negative subset in the testing set (P=0.001, Table 2). However, the significantly different expression level was not observed in RLNM positive and negative subgroups regarding to E-cadherin (P=0.327), N-cadherin (P=0.424), β-catenin (membrane, P=0.426), β-catenin (cytoplasm, P=0.406), β-catenin (nucleus, P=0.747) and Twist (P=0.845, Table 2) in the testing set. Similarly, no significant relationship was found between the EMT-related biomarkers and RLNM status (E-cadherin, P=0.484; N-cadherin, P=1.000; membranous β-catenin, P=0.484; cytoplasmic β-catenin, P=1.000; nuclear β-catenin, P=1.000; Snail, P=0.484 and Twist, P=0.243; Table 2) in the training set.

Figure 1
figure 1

Immunohistochemical staining of EMT-related biomarkers in RC normal epithelium (A, left panels, × 100, right panels, × 400), tumour tissues with negative RLNM (B, left panels, × 100, right panels, × 400) and with positive RLNM (C, left panels, × 100, right panels, × 400). In each subgroup, right panels displayed representative area of left panels with enlarged view. Normal colorectal epithelium showed strong membranous expressions of E-cadherin and β-catenin (a1–2), negative expressions of N-cadherin (a3), Snail (a4) and Twist (a5). Tumour tissues with negative RLNM showed reduced membranous expressions of epithelial markers E-cadherin and β-catenin (b1–2), weak expressions of N-cadherin (b3), Snail (b4) and Twist (b5). Tumour tissues with positive RLNM lost the membranous expressions of epithelial markers E-cadherin and β-catenin (c1–2), compared with the strong expressions of N-cadherin (c3), Snail (c4) and Twist (c5). One representative staining of EMT-related biomarkers was shown.

Table 2 Relationship between EMT-related biomarkers expression level and RLNM status in RC patients

Association between EMT-related biomarkers expression, clinicopathological variables and RLNM status

The cutoff point of each variable was determined by ROC curve analysis in the training set. In the testing set, the ROC-curve generated cutoff point of each variable was subjected to univariate analysis with respect to their RLNM status. In the training set, age (OR, 2.410; 95% CI, 0.947–6.131; P=0.065) showed borderline significance in predicting RLNM status. In the testing set, Snail (OR, 4.286; 95% CI, 1.692–10.858; P=0.002) and SVM model (OR, 9.231; 95% CI, 3.588–23.751; P<0.0001) showed significantly high power in predicting RLNM status; whereas age (OR, 1.970; 95% CI, 0.895–4.334; P=0.092) and nuclear β-catenin (OR, 2.981; 95% CI, 0.847–10.486; P=0.089) were of borderline significance. Evidently, ROC curve analysis showed that SVM model owned the largest AUC compared with other variables (AUC of SVM was 0.747 in the testing set. Table 3, Figure 2), suggesting the more powerful predictive value of SVM model in refining RLNM status.

Table 3 Association between EMT-related biomarkers expression, clinicopathological variables and RLNM status in RC patients
Figure 2
figure 2

ROC curves plotted for positive RLNM, using EMT-related biomarkers, as well as clinicopathological, serological variables and SVM model in the training set (A) and testing set (B). At each score, the sensitivity and specificity for the RLNM status being studied were plotted, thus generating a ROC curve in the training set. The score, that closest to the point with both maximum sensitivity and specificity (0.0, 1.0), was chosen as the cutoff point for further analysis in the testing set.

The SVM model in defining the RLNM status

In the training set, six EMT-related biomarkers (E-cadherin, N-cadherin, cytoplasmic β-catenin, nuclear β-catenin, Snail and Twist) and two demographical variables (gender and age) were integrated to design the SVM model. In the testing set (82 cases with RLNM and 37 cases without RLNM), we identified 63 cases with high risk to RLNM and 56 cases with low risk after educating the model in the training set. In RLNM subset, our SVM model identified 56 out of 82 cases with RLNM. Meanwhile, 30 of 37 cases were identified as low risk in subgroup without RLNM. Detailed, the predictive value of SVM in sensitivity, specificity, positive predictive value, negative predictive value and overall accuracy were 68.3%, 81.1%, 88.9%, 53.6% and 72.3%, respectively. More importantly, our multivariate logistic regression analysis revealed that SVM model was indeed an independent predictive factor for RLNM (OR, 11.536; 95% CI, 4.113–32.361; P<0.0001). Moreover, tumour stage was also displayed a significantly predictive value for RLNM, though with the lower power (OR, 4.443; 95% CI, 1.064–18.557; P=0.041) (Table 4).

Table 4 Results of multivariate logistic regression analysis in testing set

Discussion

Preoperative adjuvant therapy had proved to be of great survival benefit in RC patients with RLNM (Rohwer et al, 2010). The current imaging system was, however, inadequate to define the RLNM status, particularly for the regional lymph node with round shape and the size of <5 mm (Low et al, 2008). Results from a meta-analysis demonstrated that the sensitivities and specificities of the three imaging modalities, EUS, CT and MRI, in evaluating the lymph node involvement were similar (EUS, 67 and 78%; CT, 55 and 74%; MRI, 66 and 76%) (Bipat et al, 2004; Engelen et al, 2008). In the present study, we designed a multi-markers and mathematical algorithm combined SVM approach to predict the RLNM status in RC patients. By integrating six EMT-related biomarkers and two demographical variables, we firstly educated our SVM model in the training set and further validated its predictive value in the testing set. Our internal validation showed that the SVM model owned the powerful efficacy in individually defining RLNM status (the sensitivity and specificity were 68.3% and 81.1%, respectively, in the testing set). We found that the sensitivity and specificity of our SVM model were similar to the current three imaging modalities. However, the advantage of our SVM model was marked for the metastatic regional lymph node with round shape and the size <5 mm, which was a major limitation of the current imaging modalities. Thus, our SVM model might serve as a useful tool to complement the current imaging system in refining RLNM status and select patient tailored for preoperative adjuvant therapy in RC.

Data mining and molecular or genetic markers combined method had raised potential clinical implementation in prediction of patient outcome for solid tumour. In non-small-cell lung cancer (NSCLC), five gene signature (DUSP6, MMD, STAT1, ERBB3 and LCK)-based decision-tree analysis reached a 96% overall accuracy in predicting patient prognosis (Takahashi et al, 2007). Compared with other data mining algorithm, including of neural networks (artificial and fuzzy) (Catto et al, 2009; Anderson et al, 2010), clustering and decision trees (Modlin et al, 2009; Dietzel et al, 2010), SVM performed classification by constructing an N-dimensional space that optimally separates the data into two categories. This two-category feature made SVM model was superior in predicting RLNM status with two classifications (with RLNM vs without RLNM) for RC patients. In the present study, we applied SVM model to choose the robust markers to refine RLNM status from 13 candidate variables, including EMT-related biomarkers, as well as demographical, clinicopathological and serological biomarkers. In colorectal cancer, EMT occurred at the invasive front of tumour and acted as an important driving force for invasion and metastasis formation (Huber et al, 2005). Overexpression of EMT driver Twist, for example, had shown a close correlation with nodal invasion and unfavourable outcomes in patients with colorectal cancer (Valdes-Mora et al, 2009; Okada et al, 2010). Thus, we selected EMT signalling biomarkers, which might represent the key players in the initiation and progression of RLNM for RC, to construct our SVM model. Interestingly, our univariate logistic regression analysis found that Snail, other than Twist, E-cadherin, N-cadherin and β-catenin (cytoplasm and nucleus), was the only independent predictive biomarker to RLNM status. However, we also found that, by integrating these five predictive biomarkers, which had no significant correlation with RLNM, into SVM model, the OR in refining the risk to RLNM increased two-fold when compared with Snail (OR: 9.231 vs 4.286, Table 3) alone. Taken together, our data showed that multi-markers integrated approach, other than the single one, might reflect the progression of RLNM more concisely, leading to a potential usage in tailored selection of RLNM patients to preoperative adjuvant therapy.

In colorectal cancer, gene expression signature identified 73 discriminating genes had reached to an accuracy of 88.4% in predicting the presence of RLNM (Watanabe et al, 2009). DNA microarray analysis showed that EMT signalling transforming growth factor β inhibitor BAMBI and β-catenin coactivator BCL9-2 might be highly expressed in RLNM patients (Watanabe et al, 2009). Compared with these massive gene signature-based models (Kwon et al, 2004; Fritzmann et al, 2009; Watanabe et al, 2009), the IHC staining was easily to be implemented and our IHC-SVM arithmetical approach might to be a useful decision-support tool in future clinical practice. By complementing with the imaging system, our SVM model raised potential clinical implications for RC patients: (i) the subset that were predicted with higher RLNM risk by our SVM model could be given the preoperative chemo- or chemoradiotherapy; (ii) the subgroup that were identified as lower RLNM risk by our SVM model should be subjected to surgery as soon as possible. Otherwise, preoperative adjuvant treatment might result in unnecessary overtreatment, lead to serious side effects and cause the patients missing the optimal opportunity for effective surgery.

Moreover, we also noticed that, compared with the 96% overall accuracy of data mining method in prediction of NSCLC prognosis and the 88.4% accuracy of gene profiling in predicting RLNM in colorectal cancer (Takahashi et al, 2007; Watanabe et al, 2009), our SVM model reached a lower (72.3%) predictive accuracy in refining RLNM status for RC patients. The underlying reason might in part due to the other potential robust variables, such as lymphovascular invasion (LVI) and perineural invasion (PNI), which denoted an increased infiltrative growth pattern (Huh et al, 2010), were not included in our study. Thus, LVI, PNI and EMT signalling marker integrated SVM predictive model, complemented by imaging system, might reach a higher accuracy in our ongoing prospective studies. In addition, there are limitations of this study. The absence of an independent subset of patients with round lymph nodes <5 mm in size was one limitation. Furthermore, another independent validation set with larger sample size from multicenter should be used to test the predictive value of our SVM model.

Taken together, our study showed that multi-markers, including EMT-related biomarkers and clinicopathological features, integrated SVM model owned a powerful capacity in individually predicting RLNM status in RC patients, offering a promising method to select particular patients for preoperative adjuvant therapy.