Machine learning-based integration develops an immune-derived lncRNA signature for improving outcomes in colorectal cancer

Long noncoding RNAs (lncRNAs) are recently implicated in modifying immunology in colorectal cancer (CRC). Nevertheless, the clinical significance of immune-related lncRNAs remains largely unexplored. In this study, we develope a machine learning-based integrative procedure for constructing a consensus immune-related lncRNA signature (IRLS). IRLS is an independent risk factor for overall survival and displays stable and powerful performance, but only demonstrates limited predictive value for relapse-free survival. Additionally, IRLS possesses distinctly superior accuracy than traditional clinical variables, molecular features, and 109 published signatures. Besides, the high-risk group is sensitive to fluorouracil-based adjuvant chemotherapy, while the low-risk group benefits more from bevacizumab. Notably, the low-risk group displays abundant lymphocyte infiltration, high expression of CD8A and PD-L1, and a response to pembrolizumab. Taken together, IRLS could serve as a robust and promising tool to improve clinical outcomes for individual CRC patients.


Consensus clustering
According to the infiltration profile of various immune cells, a resampling-based method termed consensus clustering was applied for cluster discovery. This process was performed by ConsensusClusterPlus package 1 . Subsample 80% of samples at each iteration and partition each subsample into up to k (max K = 9) groups by k-means algorithm upon Euclidean distance. This process was repeated for 1,000 repetitions.
Subsequently, the consensus score matrix, cumulative distribution function (CDF) curve, proportion of ambiguous clustering (PAC) score, and Nbclust were synthetically used to determine the optimal number of clusters. A higher consensus score between two samples indicates they are more likely to be grouped into the same cluster in different iterations. The consensus values range from 0 (never clustered together) to 1 (always clustered together) marked by white to dark brown. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represents those with ambiguous assignments in different clustering runs. The "proportion of ambiguous clustering" (PAC) measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. PAC for each K is CDFk(u2) -CDFk(u1) 2 . According to his criterion, we can therefore infer the optimal number of clusters by the K value having the lowest PAC.
The Nbclust uses 26 mathematic criteria to select the optimal number.

Signature generated from machine learning based integrative approaches
To develop a consensus immune-related lncRNA signature (IRLS) with high accuracy and stability performance, we integrated 10 machine learning algorithms including random survival forest (RSF), elastic network (Enet), Lasso, Ridge, stepwise Cox, CoxBoost, partial least squares regression for Cox (plsRcox), supervised principal components (SuperPC), generalized boosted regression modeling (GBM), and survival support vector machine (survival-SVM). A few algorithms possessed the ability of feature selection, such as Lasso, stepwise Cox, CoxBoost, and RSF. Thus, we combined these algorithms to generate a consensus model. In total, 101 algorithm combinations were conducted to fit prediction models based on the leave-one-out cross-validation (LOOCV) framework. The initial signature discovery was performed in TCGA-CRC.
The RSF model was implemented via the randomForestSRC package. RSF had two parameters ntree and mtry, where ntree represented the number of trees in the forest and mtry was the number of randomly selected variables for splitting at each node. We used a grid-search on ntree and mtry using LOOCV framework. All the pairs of (ntree, mtry) are formed and the one with the best C-index value is identified as the optimized parameters. The Enet, Lasso, and Ridge were implemented via the glmnet package. The regularization parameter, λ, was determined by LOOCV, whereas the L1-L2 trade-off parameter, α, was set to 0-1 (interval =0.1). The stepwise Cox model was implemented via survival package. A stepwise algorithm using the AIC (Akaike information criterion) was applied, and the direction mode of stepwise search was set to "both", "backward", and "forward", respectively. The CoxBoost model was implemented via CoxBoost package, which is used to fit a Cox proportional hazards model by componentwise likelihood-based boosting. For the CoxBoost model, we used LOOCV routine optimCoxBoostPenalty function to first determine the optimal penalty (amount of shrinkage). Once this parameter was determined, the other tuning parameter of the algorithm, namely, the number of boosting steps to perform, was selected via the function cv.CoxBoost. The dimension of the selected multivariate Cox model was finally set by the principal routine CoxBoost. The plsRcox model was implemented via plsRcox package. The cv.plsRcox function was used to determine the number of components requested, and the plsRcox function was applied to fit a partial least squares regression generalized linear model. The SuperPC model was implemented via superpc package, is a generalization of principal component analysis, which generates a linear combination of the features or variables of interest that capture the directions of largest variation in a dataset. The superpc.cv function used a form of LOOCV to estimate the optimal feature threshold in supervised principal components. To avoid problems with fitting Cox models to small validation datasets, it uses the "pre-validation" approach.
The GBM model was implemented via superpc package. Using the LOOCV, the cv.gbm function selected index for number trees with minimum cross-validation error.
The gbm function was used to fit the generalized boosted regression model. The survival-SVM model was implemented via survivalsvm package. The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem.

RNA preparation and Quantitative Real-Time PCR (qRT-PCR)
Total RNA was isolated from CRC tissues using RNAiso Plus reagent RNA quality was evaluated using a NanoDrop One C (Waltham, MA, USA), and RNA integrity was assessed using agarose gel electrophoresis. An aliquot of 1 μg of total RNA was reverse transcribed into complementary DNA (cDNA) according to the manufacturer's protocol using a High-Capacity cDNA Reverse Transcription kit (TaKaRa BIO, Japan). qRT-PCR was performed using SYBR Assay I Low ROX (Eurogentec, USA) and SYBR® Green PCR Master Mix (Yeason, Shanghai, China) to detect the expression of 16 lncRNAs expression. The expression value was normalized to GAPDH, and then log2 transformed for subsequent analysis. The primer sequences of the included 16 lncRNAs and GAPDH were shown in Supplementary Data 6.