A RNA-Sequencing approach for the identification of novel long non-coding RNA biomarkers in colorectal cancer

Long non-coding RNAs (lncRNAs) have been implicated in human pathology, however, their role in colorectal carcinogenesis have not been fully elucidated. In the current study, whole-transcriptome analysis was performed in 3 pairs of colorectal cancer (CRC) and matched normal mucosa (NM) by RNA sequencing (RNA-seq). Followed by confirmation using the Cancer Genome Atlas (TCGA) dataset, we identified 27 up-regulated and 22 down-regulated lncRNAs in CRC. Up-regulation of four lncRNAs, hereby named colorectal cancer associated lncRNA (CRCAL)-1 [AC021218.2], CRCAL-2 [LINC00858], CRCAL-3 [RP11-138J23.1] and CRCAL-4 [RP11-435O5.2], was further validated by real-time RT-PCR in 139 colorectal neoplasms and matched NM tissues. Knockdown of CRCAL-3 and CRCAL-4 in colon cancer cells reduced cell viability and colony formation ability, and induced cell cycle arrest. TCGA dataset supported the associations of CRCAL-3 and CRCAL-4 with cell cycle and revealed a co-expression network comprising dysregulated lncRNAs associated with protein-coding genes. In conclusion, RNA-seq identified numbers of novel lncRNAs dysregulated in CRC. In vitro experiments and GO term enrichment analysis indicated the functional relevance of CRCAL-3 and CRCAL-4 in association with cell cycle. Our data highlight the capability of RNA-seq to discover novel lncRNAs involved in human carcinogenesis, which may serve as alternative biomarkers and/or molecular treatment targets.

It is estimated that more than 70% of the human genome is transcribed into RNA, but only up to 2% is translated to proteins; hence, majority of RNA do not serve as a blue print for protein coding genes. RNA molecules which do not encode proteins are called non-coding RNAs (ncRNAs), and historically, most of them in the past were considered as transcriptional noise. Based on their length, ncRNAs are divided into two subgroups; small ncR-NAs which are shorter than 200 nucleotides, and long ncRNAs (lncRNAs) that consist of 200 nucleotides or more in length [1][2][3] . Recent decade has witnessed a growing recognition for the functional relevance of microRNAs, a subgroup of small ncRNAs, as transcriptional repressors by virtue of their interaction with the 3′UTR regions of their downstream target genes. MicroRNAs are known to be involved in cellular differentiation, proliferation and apoptosis, and their dysregulation is known to associate with various human malignancies 4 . In contrast to miRNAs, the biological role of lncRNAs still remain poorly understood, and are an active area of investigation. However, cell-type and developmental time-point specific expression patterns and conserved sequences of lncR-NAs raise the possibility that they also possess functional significance in the biological context [1][2][3] . In fact, functional importance of several lncRNAs have been recently elucidated. For example, HOTAIR recruits polycomb repressive complex 2 to specific target genes, leading to epigenetic re-programming, and its increased expression levels were linked to progression of breast and gastric cancers 5,6 . Another lncRNA, MALAT1, is known to regulate gene expression and alternative splicing, and has been linked to lung and several other human cancers 7 . Nonetheless, majority of lncRNAs have not been well characterized. Given the abundance of lncRNAs existing in human genome, there are perhaps a number of uncharacterized lncRNAs that possibly play key roles in human cancers. Therefore, it would be important to identify and investigate novel lncRNAs involved in human carcinogenesis, which are potentially relevant as biomarkers and/or molecular targets for treatment. RNA sequencing (RNA-seq) is an approach to analyze whole-transcriptome using the next generation sequencing technology, which enables to virtually reconstruct an entire transcriptome, including lncRNAs. With its advantages in terms of a greater dynamic range and the ability to discover novel transcripts, RNA-seq is capable of identifying unknown lncRNAs involved in human pathology 8 . Indeed, RNA-seq technology has been utilized to discover novel lncRNAs in various diseases including prostate 9 , breast 10 , and gastric 11 cancers.
Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths in the Unites States, and more than 50,000 patients die of this disease annually 12 . Although molecular alterations involved in CRC have been well-known in terms of genetic mutations as well as epigenetic alterations such as DNA methylation 13,14 , the role of lncRNAs and their dysregulation in colorectal carcinogenesis has yet not been fully elucidated.
In the current study, we conducted a systematic and comprehensive identification of novel lncRNAs involved in colorectal carcinogenesis. To this end, we performed RNA-seq using matched cancerous and non-cancerous human colon tissues, followed by the validation of dysregulated lncRNAs by analyzing the Cancer Genome Atlas (TCGA) database (http://cancergenome.nih.gov/) and by real-time RT PCR. The aim of this study was to identify novel lncRNAs associated with colorectal carcinogenesis as alternative biomarkers and/or treatment targets for CRC by using RNA-seq technology.

RNA-seq read mapping.
A splice-aware mapping solution was implemented for RNA-seq read alignment.
The alignment index was built on hg19 genome (including 25 chromosomes and other 68 unplaced contigs) combined with total junction flanking TRANSCRIPTOMIC sequence summarized from GENCODE, EMSEMBLE and REFSEQ annotations. The junction flanking sequence length was defined by the read length subtract 5. Novoalign+ V2.08.01 was used for alignment. Redundant mapping at the same locus for both genome and transcriptome was consolidated as one single hit. The read count for each annotated transcript was then derived from mapped reads by Rsubread 15 . Some of the key statistics of read mapping is shown on Table 1.

Identification of dysregulated lncRNAs by RNA-seq.
A heatmap generated from expression of differentially expressed lncRNAs detected by edgeR on three pairs of matched CRC and NM tissues showed distinct expression patterns of these lncRNAs between CRC and NM tissues. (Fig. 1a) Heatmap plot for the expression of these lncRNAs on TCGA dataset is also shown in Fig. 1a. By analyzing in-house RNA-seq data, 72 lncRNAs were found to be significantly dysregulated in CRC compared to NM tissues. Of these, 27 of 36 up-regulated lncRNAs and 22 of 36 down-regulated lncRNAs were confirmed by TCGA dataset ( Table 2). Dysregulation of CCAT1 16 , UCA1 17 , and MEG3 18 has been previously linked to CRC, while dysregulation of LINC00974 19 and TRPM2-AS 20 have been reported in hepatocellular carcinoma and prostate cancer, respectively. In addition, RP11-115D19. 1 and TINCR have been functionally associated with certain biological contexts 21,22 . Thus, as a result of in-house RNA-seq and confirmation by TCGA dataset, we could identify 42 novel lncRNAs which have not been previously well documented. When we visualized the RNA-seq reads at each dysregulated lncRNA by using IGV, in contrast to many protein-coding genes showing much larger read counts, majority of dysregulated lncRNAs had very small read counts which were mostly less than 10 (Fig. 1b,c).

Knockdown of CRCAL-3 and CRCAL-4 in colon cancer cells.
To gain further insight into whether these dysregulated lncRNAs have any functional role in CRC, we performed knockdown of CRCAL-3 and CRCAL-4 by transfecting siRNAs in colon cancer cells, HCT116 and SW620. As expected, levels of both CRCAL-3 or CRCAL-4 decreased significantly after transfection of siRNA specific for respective lncRNA, confirming their successful knockdown (Fig. 3a). CRCAL-3 knockdown resulted in decreased cell viability by MTT assay, reduced colony formation ability, and cell cycle arrest at G0/G1 in both cell lines. Knockdown of CRCAL-4 showed similar effects in HCT116 cells, however, it caused minimal inhibition of cell viability but no obvious effects on colony formation ability nor cell cycle progression in SW620 cells ( Fig. 3b-d).

Analyses of CRCAL-3 and CRCAL-4 expression in TCGA dataset.
To further validate the significance of CRCAL-3 and CRCAL-4 in colorectal carcinogenesis using an independent dataset, we again utilized RNA-seq data of 682 colon cancers and 41 normal tissues from TCGA database. First, we compared the expression levels of CRCAL-3 and CRCAL-4 between colon cancers and normal tissues, and as mentioned above, we confirmed the significant up-regulation of two lncRNAs in colon cancers (Fig. 4a). Since we observed the relation of two lncRNAs to cell cycle by in vitro knockdown experiments, we next analyzed the association of either CRCAL-3 or CRCAL-4 with cell cycle. We performed the GO term enrichment analysis (Fig. 4b), and found that the ranks of correlation between CRCAL-3 (CRCAL-4) and cell cycle (GO:0007049) genes were significantly higher than those between CRCAL-3 (CRCAL-4) and background genes, which indicates the significant association of these lncRNAs with cell cycle. Finally, we drew a co-expression network comprised of vertices, which represent differentially expressed lncRNAs or protein-coding genes from RNA-seq datasets, and edges, which represent the co-expression (measured by Pearson's correlation) of lncRNAs and protein-coding genes in colon cancer tissues. As shown in Fig. 4c, some of hub lncRNAs in the network have been verified either by literatures (CCAT1 in 16 , MEG3 in 18 , LINC00974 19 and TINCR in 21,22 ) or by RT-PCR as well as in vitro knockdown in our experiments (CRCAL-1, CRCAL-2, CRCAL-3 and CRCAL-4).
There are large overlaps between the target protein-coding genes of CRCAL-1, CRCAL-3 and CRCAL-4, therefore these three lncRNAs may function together in a pathway that is different than that of CRCAL-2.

Discussion
In the current study, we performed a systematic and comprehensive identification of dysregulated lncRNAs in CRC, and found four lncRNAs, CRCAL-1, CRCAL-2, CRCAL-3 and CRCAL-4, as novel lncRNAs involved in colorectal carcinogenesis. First, by using RNA-seq technology followed by analysis of TCGA dataset, we discovered distinct lncRNA expression patterns between CRC and adjacent NM tissues. Furthermore, we could identify a number of candidate lncRNAs that were dysregulated in CRCs. Looking into RNA-seq data, the read depths for these lncRNAs were generally small. In fact, maximum read counts were less than 10 in most of our candidate lncRNAs. The small read depth can be caused by technical issues regarding RNA-seq. However, considering that the read depths of many protein-coding genes were much larger, the low read depths of lncRNAs is more likely to reflect the limited abundance of lncRNA molecules compared to those of mRNAs of protein-coding genes. In fact, out of 72 lncRNAs found by our in-house RNA-seq analysis, we were able to confirm the dysregulation of 49 lncRNAs by utilizing TCGA dataset in 682 colon cancers. Moreover, we could further validate the up-regulation of four CRCALs in an independent cohort of 139 pairs of colorectal tumors and adjacent mucosa by means of real-time RT-PCR. Thus, although the low read depths made it difficult to distinguish lncRNA sequence from artifacts, and we used as little as three pairs of CRC and NM tissues for initial discovery by RNA-seq, our bioinformatics analyses enabled successful identification of novel lncRNAs associated with colorectal carcinogenesis. Collectively, RNA-seq approach appears to be a useful technology to discover novel lncRNAs that are involved in CRCs, and perhaps in other human cancers. Although we identified novel CRCALs up-regulated in CRCs, their functional roles have not been previously elucidated. Since expression levels of all four CRCALs were elevated in CA tissues, and no obvious stepwise increase during the course of CRC progression was observed, these might be involved in the very early steps of neoplastic process. Given this early dysregulation of CRCALs in colorectal carcinogenesis, these noncoding RNAs might serve as potential biomarkers for early detection of CRC. Therefore, in the future studies, it is important to determine whether their dysregulation is also detectable by using noninvasively collected samples such as blood. In addition, it should be further investigated if CRCRLs have any association with known biomarkers of CRC such as microsatellite instability (MSI). In colon cancer cells, we observed that knockdown of CRCAL-3 and CRCAL-4 reduced cell viability and colony formation ability, and induced cell cycle arrest at G0/G1 phase. The associations of these lncRNAs with cell cycle were further supported by GO term enrichment analysis performed by using TCGA dataset, indicating the functional relevance of CRCAL-3 and CRCAL-4 in CRC. By analyzing the TCGA data, we also found a strong association between the expression levels of dysregulated lncRNAs and those of protein-coding genes forming a co-expression network. Although we only focused on functional relevance of two CRCALs, such co-expression network associating multiple lncRNAs and protein-coding genes may play important roles in driving colorectal neoplasia.
In conclusion, we conducted a systematic and comprehensive study to identify novel lncRNAs involved in colorectal carcinogenesis by using RNA-seq technology. We identified CRCAL-1, CRCAL-2, CRCAL-3 and CRCAL-4 as up-regulated lncRNAs in CRC in two independent cohorts. Functional relevance of CRCAL-3 and CRCAL-4 related to cell cycle was suggested by in vitro experiments as well as by GO term enrichment analysis in the TCGA dataset. Our data highlight the capability of RNA-seq technology to discover novel lncRNAs involved in human carcinogenesis, which may serve as alternative biomarkers and/or molecular treatment targets for human cancers.   Cell viability, cell cycle, and colony formation assays. Cell viability was determined using the MTT (3-(4,5-dimethylthiazole-2-yl)-2,5-diphenyl tetrazolium bromide) assay (Sigma-Aldrich, catalog number M5655) as previously described 24 . Cells were transfected with either siRNA specific for CRCAL-3, CRCAL-4 or negative control siRNA, and re-plated at 5 × 10 3 in 96-well plates after 48 hours incubation. Optical density (OD) was determined at 565 nm by spectrophotometry (Infinite M200 PRO, Tecan, Männedorf, Switzerland) at 24, 48, 72, 96 and 120 hours after re-plating. Cell cycle analysis was conducted 96 hours after siRNA transfection using the Cell Cycle Assay Kit (Merck Millipore, Billerica, MA, U.S.A., catalog number MCH100106) and the Muse Cell Analyzer (Merck Millipore) according to the manufacturer's instructions. For colony formation assays, cells were re-plated at 5 × 10 2 in 6-well plates 72 hours after siRNA transfection. About 14 days later, cells were fixed and then stained by 0.5% crystal violet (Sigma-Aldrich, catalog number HT90132), and the number of colonies was counted using the GeneTools image analysis software (Syngene, Frederick, MD, U.S.A.). All experiments were conducted in at least two independent times. TCGA data analyses. RNA-seq data for colon cancer (682 samples) as well as normal colon tissues (41 samples) were downloaded from Cancer Genomics Hub. Differential gene expression analysis was performed on this dataset to verify the differentially expressed genes found from 3-pair RNA-Seq dataset. Large sample size of TCGA dataset enables us to perform correlation-based gene set enrichment analysis. Pearson's correlation test between lncRNA of interest and other gene was performed and ranked. GO term enrichment analysis was performed on the ranked gene set through topGO (http://www.bioconductor.org/packages/release/bioc/ html/topGO.html). TCGA dataset also enables us to build transcriptional co-expression network. The edges of the co-expression network were chosen based on the correlation between lncRNA and protein-coding gene across TCGA colon cancer samples (0.5% of strongest negative correlation and 0.5% of strongest position correlation). The vertices of the co-expression network were chosen based on differentially expressed lncRNAs and protein-coding genes on both TCGA RNA-Seq dataset (FDR < 0.05) and 3-pair RNA-Seq dataset (FDR < 0.2). Statistical analysis. Differential gene expression of RNA-seq data was analyzed by edgeR 25 . Read counts were fitted into Negative Binomial distribution with two GLM models: one model has only one regressor (patient), whereas the other model has two regressors (patient and treatment; CRC or NM). And then a pairwise comparison between matched CRC and NM was performed using likelihood ratio test between the two GLM models. Genes with a false discovery rate (FDR) less than 0.20 on 3 pairs RNA-Seq dataset (FDR < 0.05 on TCGA dataset) were considered to be significantly dysregulated. Statistical analyses to compare the lncRNA levels measured by real-time RT-PCR were carried out using JMP ® 10 (SAS institute Inc., Cary, NC, U.S.A.). The Wilcoxon signed-rank test was conducted for the comparison between matched colorectal tumor and NM tissues. The Kruskal-Wallis test was performed to compare lncRNA levels among tumor stages, and the Steel-Dwass test was used to perform all-paired multiple comparisons. All experimental data were presented as mean ± SD, and the Student's t-test was used to compare the differences between groups. All P-values were two-sided and a P-value of <0.05 was considered significant.