Classifying gastric cancer using FLORA reveals clinically relevant molecular subtypes and highlights LINC01614 as a biomarker for patient prognosis

Molecular-based classifications of gastric cancer (GC) were recently proposed, but few of them robustly predict clinical outcomes. While mutation and expression signature of protein-coding genes were used in previous molecular subtyping methods, the noncoding genome in GC remains largely unexplored. Here, we developed the fast long-noncoding RNA analysis (FLORA) method to study RNA sequencing data of GC cases, and prioritized tumor-specific long-noncoding RNAs (lncRNAs) by integrating clinical and multi-omic data. We uncovered 1235 tumor-specific lncRNAs, based on which three subtypes were identified. The lncRNA-based subtype 3 (L3) represented a subgroup of intestinal GC with worse survival, characterized by prevalent TP53 mutations, chromatin instability, hypomethylation, and over-expression of oncogenic lncRNAs. In contrast, the lncRNA-based subtype 1 (L1) has the best survival outcome, while LINC01614 expression further segregated a subgroup of L1 cases with worse survival and increased chance of developing distal metastasis. We demonstrated that LINC01614 over-expression is an independent prognostic factor in L1 and network-based functional prediction implicated its relevance to cell migration. Over-expression and CRISPR-Cas9-guided knockout experiments further validated the functions of LINC01614 in promoting GC cell growth and migration. Altogether, we proposed a lncRNA-based molecular subtype of GC that robustly predicts patient survival and validated LINC01614 as an oncogenic lncRNA that promotes GC proliferation and migration.


Comparison of the fraction of reads derived from noncoding regions
To comprehensively characterize the noncoding transcriptome of GC, we reanalyzed the whole transcriptome sequencing data of 407 TCGA samples, including 375 GC and 32 tumoradjacent samples from 380 patients (Table S1) using the FLORA pipeline. The total mapped reads in the raw bam file and bam files sequentially filtered by the generateFilteredBam to remove the protein coding genes (gene type as protein coding in GENCODE v27), other transcripts (including immunoglobin and T-cell receptor family, mitochondrial genes, miRNA, misc_RNA, pseudogenes, rRNA, ribozyme, sRNA, scRNA, scaRNA, snRNA, snoRNA and vaultRNA in GENCODE v27) and rRNAs (downloaded from RefSeq), and mapping quality (with MAPQ below 10 removed) were counted. The remaining reads are counted towards the "potential known and novel lncRNAs" that are useful in noncoding transcript assembly.
The fraction of reads derived from potential known and novel lncRNAs are compared between paired tumor and tumor-adjacent normal samples from 32 patients in TCGA using paired t-test.
Among the 32 pairs, we observed that the fraction of noncoding region-derived regions was significantly higher in tumor compared to normal (P = 0.0045). Higher or similar fraction of noncoding reads was observed in tumor compared to tumor-adjacent samples, in the majority of cases. However, 5 normal samples derived from the same institute (sample ID: TCGA-HU-A4GY-11A, TCGA-HU-A4GH-11A, TCGA-HU-A4GP-11A, TCGA-HU-A4GC-11A, TCGA-HU-A4HB-11A) showed unusually higher fraction of noncoding region-derived reads (fraction ranging from 3.0% to 7.8%) than other 27 normal samples (0.9% to 2.8%), suggesting potential batch effects in these 5 samples. Thus, these 5 normal and 5 paired tumor samples are excluded from this analysis.

Gene expression calculation
The featureCount module [8] from R package 'Subread' was used to call read counts of annotated genes in GENCODE Release 27 (GRCh38.p10) and novel lncRNAs identified by FLORA, and normalized as FPKM (Fragments Per Kilobase per Million mapped fragments).

Differential expression analysis
Differential expression analysis of lncRNAs between tumor and normal samples was performed with DESeq2 [9]. Significantly upregulated or downregulated lncRNAs were selected by Benjamin-Hochberg adjusted P < 0.01.

Identification of LncRNA-based molecular subtype in TCGA
To identify lncRNA-based molecular subtypes, we conducted hierarchical clustering with Ward.D linkage. The distance metric was 1 -Pearson's correlation coefficient and the procedure was iterated 1,000 times with subsampling ratio of 0.8 using normalized expression of lncRNAs which were upregulated in tumor compared to normal samples. A lncRNA's expression level in tumor samples was normalized using the average expression of the lncRNA in tumor-adjacent normal samples: The normalized expression level was further center-normalized. The clustering of tumor samples was performed with R package 'ConsensusClusterPlus' [10]. To assess the ideal number of identified clusters, cumulative distribution functions (CDF) of consensus indexes were estimated for k (number of clusters) from 2 to 10.

Definition of lncRNA-based subtypes (L1/L2/L3) in independent cohorts
To define L1/L2/L3 subtypes in independent cohorts with only microarray data (but not RNAseq), we developed machine learning methods to classify GC patients. As the lncRNA-based molecular subtypes (L1/L2/L3) were well defined on TCGA patients, we first performed Wilcoxon rank-sum tests to extract features (both coding and noncoding genes) that are differentially expressed among subtypes using RNA-seq data of all TCGA GC patients. The microarray platforms capture most of coding gene but only a small number of noncoding genes, limiting the capability to directly define L1/L2/L3. Therefore, for a given patient cohort characterized by the microarray platform, we will use coding feature genes to predict their noncoding-expression subtypes. In particular, three support vector machine (SVM) classifiers were built to predict whether a sample is L1, L2 or L3 respectively.
To train the L1 classifier, we respectively selected top 1,000 differentially expressed genes (DEGs) between L1 and L2, as well as top 1,000 DEGs between L1 and L3. We then conducted principal component analysis on these DEGs to reduce the feature dimension. Subsequently, Wilcoxon rank-sum tests were performed on the principal components to select the top 50 principal components enriched in the L1 subtype. Using the Z-score normalized principal components, we tuned the hyperparameters of a SVM classifier by 10-fold cross-validation in the TCGA dataset. Finally, this classifier will be applied to the independent cohort for L1 subtyping. Similarly, L2 and L3 classifiers will be trained in the same manner. To determine the final subtype of a GC case, we will run the L3 classifier first. If it was classified as L3 the procedure will stop; otherwise, the L3-negative samples will be sequentially evaluated by L2 and L1 classifiers. The predicted subtype of samples that were not identified by any of the classifiers were labeled as "Not Available" (NA).

Survival analysis
Survival analysis was performed with python package 'lifelines' [11]. The associations between the expression of lncRNAs and survival was estimated by two approaches: (1) Cox regression analysis; (2) segregating patients into high-expression and low-expression groups by different cut-offs of FPKM level and calculating the P-value by log-rank test, and the result with the most significant P-value was reported.

Copy number alterations analysis
We used a cut-off of +/− 0.3 on the segment mean (log2 transformation of the copy number) to define copy number gain/loss, where approximately 99% of all segments in normal samples were below this threshold. GISTIC 2.0 (v2.0.12) was used to identify regions of the genome that were significantly gained or deleted across a set of samples using a Q-value cut-off <0.05 [12].

Analysis of DNA methylation level of lncRNA genes
To analyze the methylation level of all annotated and novel lncRNAs, the probes on

Characterization and prioritization of oncogenic lncRNAs in GC
The 50 lncRNAs that are upregulated and associated with poor prognosis in GC are prioritized by a priority score calculated as below:

RNA extraction, semiquantitative reverse transcription PCR, and real-time PCR analyses
Total RNA was extracted using TRIzol Reagent (Invitrogen according to the manufacturer's protocol.

LentiCRISPR for LINC01614 knockout experiment in GC cell lines
sgRNA (GTGTAAGGTACTCAAGTGCT) targeting LINC01614 was cloned into the lentiCRISPR vector. Following transformation, plasmids were purified and the insertion of sgRNA was confirmed by sequencing. The resulting vector or control vector (empty vector) was transfected into 293T cells with psPAX2 and pMD2.G using Lipofectamine2000 Transfection Reagent (Life Technologies, Carlsbad, CA). Lentiviral particles were produced to infect MKN28 cells. After infection with lentivirus, cells were selected with puromycin.

Colony formation assay
Cells were seeded in 48-well plates (2×10 2 cells/well) and cultured at 37°C in 5% CO2. After 10 days, the cells were washed with PBS and stained with crystal violet. ddH2O was used to wash the cells three times to obtain a clean background.

Cell migration assessment by wound-healing assay
Cells were cultured in 24-well plates with the density of 5×10 4 cells per well. After confluence, a wound was made across the well with a 200 μL pipette tip. The wound was photographed immediately. The cells migrated across the gap wound were observed and documented using an inverted microscope. Area of the gap was quantified using Image J. Two different areas in each well were imaged, measured and averaged in statistical analysis.

Cell proliferation assay (MTT assay)
After 15G of raw data were generated for each sample. The raw sequencing data was mapped to human reference genome hg38 using STAR [17], and the total read pairs mapped to each gene annotated in GENCODE release 27 were obtained using featureCounts [8] in The gene list ranked by fold change was used to perform Gene Set Enrichment Analysis [18,19] on all gene sets in the Mutation Signature Dataset v7.2 [20], and gene sets with significant associations (NOM p-val < 0.05 and FDR q-val < 0.25) in all cell lines were reported in the Table S8.

Data and code availability
The FLORA pipeline is deposited at http://wang-lab.ust.hk/software/Software.html.