Long intergenic non-coding RNA expression signature in human breast cancer

Breast cancer is a complex disease, characterized by gene deregulation. There is less systematic investigation of the capacity of long intergenic non-coding RNAs (lincRNAs) as biomarkers associated with breast cancer pathogenesis or several clinicopathological variables including receptor status and patient survival. We designed a two-stage study, including 1,000 breast tumor RNA-seq data from The Cancer Genome Atlas (TCGA) as the discovery stage, and RNA-seq data of matched tumor and adjacent normal tissue from 50 breast cancer patients as well as 23 normal breast tissue from healthy women as the replication stage. We identified 83 lincRNAs showing the significant expression changes in breast tumors with a false discovery rate (FDR) < 1% in the discovery dataset. Thirty-seven out of the 83 were validated in the replication dataset. Integrative genomic analyses suggested that the aberrant expression of these 37 lincRNAs was probably related with the expression alteration of several transcription factors (TFs). We observed a differential co-expression pattern between lincRNAs and their neighboring genes. We found that the expression levels of one lincRNA (RP5-1198O20 with Ensembl ID ENSG00000230615) were associated with breast cancer survival with P < 0.05. Our study identifies a set of aberrantly expressed lincRNAs in breast cancer.

Global changes of the lincRNA species in breast cancer. Our transcriptomic analyses focused on the lincRNA species, a specific type of lncRNAs. First, we investigated the lincRNA expression profile in 85 paired samples of primary breast tumor tissue and adjacent normal tissue from TCGA. After stringent filtering (see Methods), we totally identified 2,171 lincRNAs expressed in at least one sample. As expected, most lincRNAs expressed at low levels and sparsely expressed in < 10% of all samples ( Figure S1). Similar observations were also reported elsewhere [22][23][24] . For subsequent analysis, we focused on 584 lincRNAs that expressed in at least 20% of samples. Compared with the matched adjacent normal tissues, we observed that these lincRNAs showed a global decreased expression profile in breast tumors ( Figure S2). The similar trend has been reported for small non-coding RNAs in multiple cancer types 25,26 . Specifically, among these 584 lincRNAs, 83 were significantly differential expression (DE) with a permutation-based FDR < 1% (Fig. 1A). Among them, 60 lincRNAs displayed down-regulation in breast tumor tissues.

Validation of lincRNA expression in an independent study. To validate the DE lincRNAs identified
in TCGA, we generated an independent RNA-seq dataset from 50 pairs of matched tumor and adjacent normal samples from breast cancer patients and 23 normal breast tissues from healthy women in the replication study. Among the 83 DE lincRNAs, 66 lincRNAs (80%) showed consistent expression change direction (Fig. 1B). Specifically, 37 lincRNAs (45%) were replicated with absolute log2-FC ≥ 1 and P < 0.05 in the 50 pairs of matched tumor and adjacent normal breast tissues in replication stage ( Table 2). Over 80% of them also showed significant expression difference in the tumors compared to normal breast tissues in the validation stage (Table 2). When we relaxed the cutoff to absolute log2-FC ≥ 0.585 and P < 0.05 in the replication dataset (Table S2), 47 DE lincRNAs (57%) could be replicated in our independent study, which is consistent with the previous report 19 .
We further examined the expression profiling of the 37 replicated lincRNAs in the remaining 830 breast tumor tissues samples without matched normal tissues from the TCGA dataset. The expression patterns of these 37 lin-cRNAs in the 830 tumor samples were very similar to those in 85 normal-paired breast tumor samples, but very different from those in adjacent normal breast tissue samples in the discovery stage (Fig. 1C). Principal component analysis (PCA) based on these 37 lincRNAs also showed that all breast tumor samples are clustered together, but separated from the adjacent normal tissues (with a few exceptions) (Fig. 1D). In addition, we also checked the expression levels of these 37 lincRNAs across 14 breast cancer cell lines that are commonly used in the lab (Table S3). Most of these 37 lincRNAs were expressed in at least one common breast cancer cell line, making these  11 TFs exhibited a significant differential expression (absolute log2-FC ≥ 0.585, BH-adjusted P < 0.01) in breast cancer tissues compared with adjacent normal tissues (Fig. 2B). We also found that several over-expressed lincRNAs may be related to these key TFs. For example, two lincRNA genes (RP11-279F6 with Ensemble ID ENSG00000245750 and GATA3-AS1 with Ensemble ID ENSG00000197308) had DNA binding peaks by 16 and 20 TFs, respectively. Specifically, all three over-expressed key regulators in breast tumors, GATA3, E2F1 and RAD21, showed connection with these two lincRNAs (Fig. 2C,D). These results suggest that the expression alteration of lincRNAs in breast cancer is probably functionally associated by one or multiple TFs in a complex way.
Co-expression of lincRNAs with nearby protein-coding genes. Several studies have reported that lincRNAs show an expression correlation with their neighboring protein-coding genes in normal tissues/cells 5,27 . We examined whether 37 DE lincRNAs could also exhibit the differential co-expression profile with the adjacent genes in breast cancer development. We classified these 37 lincRNA/mRNA pairs into three types: head-head  (H2H), head-tail (H2T) and tail-tail (T2T) based on the paired transcriptional orientation (Fig. 3A). We observed a coordinated expression change (r 2 = 0.25, P = 1.0 × 10 −4 ) between lincRNAs and their neighbor mRNAs via a comparison of 85 paired breast tumor and adjacent normal breast tissues ( Fig. 3A and B). We detected 14 (37.8%) lincRNA/mRNA pairs with the H2H type, which shows marginally significant over-representation (p = 0.06, one-sided Binomial test) relative to the global distribution. Similar results were observed when we analyzed 83 initially detected DE lincRNAs ( Figure S3). Taking the H2H pair of GATA3-AS1 (Ensembl ID ENSG00000197308) and its adjacent (aparting from 1220 bp) protein-coding gene GATA3 as an example ( Figure S4), the correlation coefficient between these two genes was 0.54 with P = 6.7 × 10 −5 . These observations are consistent with the previous reports that both GATA3-AS1 and GATA3 are co-expressed in mouse and human TH2 cells 28,29 . Altogether, the lincRNAs divergently transcribed with protein-coding genes are more likely to show the differentially co-expression profiles.
To further determine the uniqueness of the differential co-expression observation of lincRNA-mRNA adjacent pairs, we characterized the co-expression patterns of non-neighboring lincRNA-mRNA pairs and randomly shuffled lincRNA-mRNA pairs using the same 85 pairs of breast cancer/adjacent normal tissue in discovery stage (see Methods). As expected, there were no co-expression changes for the random pairs ( Figure S5). In addition, the neighboring lincRNA-mRNA pairs showed a higher Spearman's rank correlation coefficient than those of non-neighboring lincRNA-mRNA pairs (Fig. 3B).

Functional Prediction of lincRNAs.
To date, thousands of lincRNAs have been annotated, while the biological functions are unclear for most of them. Based on the Refseq genes showing strong co-expression relationship with lincRNAs, a method commonly used for functional prediction of unknown genes 30,31 , we predicted the biological functions for these 37 lincRNAs (see Methods). Enrichment analysis of GO terms and KEGG pathways showed that these down-regulated lincRNAs might be associated with transcriptional regulation, RNA processing and translational elongation processes, etc. In contrast, the up-regulated lincRNAs probably participate in cancer cell migration and proliferation, including cell adhesion, regulation of epithelial cell proliferation and regulation of cell cycle (Table S5). For example, two over-expressed lincRNAs (RP11-417E7 with Ensemble ID ENSG00000261039 and AC093850 with Ensemble ID ENSG00000230838) showed over-expression in both the discovery and the replication stage. These two lincRNAs may be involved in the ECM-receptor interaction and cell adhesion, TGFβ signaling pathway and others (Fig. 3C-F). The TGFβ signaling pathway is well documented with a promoter of tumor progression and invasion 32 . Together, these two lincRNAs probably participate in the pathogenesis of breast cancer.
LincRNAs associated with breast cancer subtypes. We also investigated whether these 37 DE lin-cRNAs exhibited expression difference across different breast cancer subtypes. We found three lincRNAs: GATA3-AS1 (ENSG00000197308), RP11-279F6 (ENSG00000245750) and AC017048 (ENSG00000224577), showed specifically high expression levels in ER-positive (ER+ ), compared to ER-negative (ER-) cancers and normal breast tissue samples (Fig. 4A). The specific expression alteration of these three lincRNAs in ER+ subtype was also validated in the replication stage. We also used the data of differentially expressed lincRNAs across four breast cancer subtypes (Luminal A, Lumnal B, Her2, and Basal-like) reported in Su et al. study 30 to evaluate the subtype associations for these 37 lincRNAs. We found that the 22 (59.5%) lincRNAs (8 up-regulated lincRNAs and 14 down-regulated lincRNAs) were differentially expressed in at least one breast cancer subtype (Table S6). Three above reported lincRNAs over-expressed in the ER+ subtype (Fig. 4A) showed much higher expression levels in Luminal A and Lumnal B subtypes (enriched for ER+ ), relative to Her2, and Basal-like subtypes.
Further integrating two independent estrogen receptor alpha (ERα ) ChIP-seq dataset (each with two replicates) in MCF-7 cancer cells, we observed the ERα binding sites in all three lincRNA loci (Fig. 4B-D). For example, an ERα DNA binding site was found near the transcriptional terminal region of the GATA3-AS1 based on the analysis of ERα ChIP-seq data (Fig. 4B). This binding region was further annotated as the active enhancer region using chromatin states in human mammary epithelial cells (HMECs). Similarly, this enhancer state was also observed in the other two lincRNAs (Fig. 4C,D). These results indicated that these three ER status associated lincRNAs may be regulated by ERα .
LincRNAs associated with breast cancer survival. We assessed the association of those 37 lincR-NAs expression levels with breast cancer survival and found one candidate (RP5-1198O20 with Ensembl ID ENSG00000230615, Fig. 5). Patients with expression levels of lincRNA ENSG00000230615 higher than the median value (median FPKM value is 2.8) had worse survival rates than those women with expression levels less than the median value (median FPKM value is 0.5) (Fig. 5).

Discussion
In this study, we analyzed lincRNA transcriptome in over 1000 breast tissue samples. Two independent sequencing datasets consistently identified a set of lincRNAs deregulated in breast carcinogenesis. Importantly, the protein-coding genes neighboring these deregulated lincRNA loci also showed expression alternation in breast cancer tissues, implying the transcriptional concordance between lincRNAs loci and neighboring genomic regions in cancer development. The expression aberration of lincRNAs in breast cancer may be associated with the expression alteration of multiple TFs. Our work substantially extends the biological understanding of the lincRNA repertoire in the pathogenesis of breast cancer.
It should be noted that the aberrant expression of several lincRNAs previously identified in multiple cancers showed deregulation in the present study. For example, in both discovery and validation stages, expression levels of the MEG3 decreased in breast tumor samples, which is consistent with the proposed tumor suppressor role for MEG3 12,30 . Another un-regulated lincRNA, GATA3-AS1 (Ensembl ID ENSG00000197308) was also identified in Ding et al. study 33 . Our in silico functional prediction (Table S5) indicated the GATA3-AS1 probably performs an immune response associated role in breast cancer progression. This result was supported by previous two studies 29,34 showing that the GATA3-AS1 is highly expressed in T helper subsets. Spurlock et al. also speculated that the GATA3-AS1 might play a role in allergic or asthmatic responses 34 . In addition, this lincRNA was also reported to have a decreased expression level in brain, bladder and prostate cancers 22 . Another breast cancer survival associated lincRNA RP5-1198O20 (Ensembl ID ENSG00000230615) showed the up-regulation in both stages Scientific RepoRts | 6:37821 | DOI: 10.1038/srep37821 of this study. Re-examining Gibb et al. integrated SAGE-seq findings 22 , we confirmed that this lincRNA also showed an increased expression in breast cancers. Another study about transcriptome analysis of aging identified the down-regulation of this lincRNA 35 . Further functional investigation of the lincRNA RP5-1198O20 in either carcinogenesis or aging would be interesting. In another example, our data showed the down-regulation of the miRNA-145 host gene (MIR145 with Ensembl ID ENSG00000269936) in breast cancer, consistent with previous reports in breast cancer 36 and other cancer types 37,38 . However, several well-characterized lincRNAs 12,22,23,39,40 , including HOTAIR, H19, GAS5, PCA3, PVT1, were not investigated in this study. Those lncRNAs are not transcribed from intergenic regions, belonging to either anti-sense or other types of long non-coding transcripts, according to the GENCODE (version17) annotation. That means these lncRNAs are difficult to distinguish from their host protein-coding genes using the typical RNA-seq technology. There are several lncRNAs with no polyA tails 41,42 which are unmeasured in the current study as well. Therefore, more sophisticated methods, such as strand-specific and non-poly(A) tail RNA-seq technology, are required to distinguish anti-sense transcripts from protein-coding genes, and to comprehensively capture the lncRNA transcriptome.
Several lincRNAs show expression alteration in other cancers, including NEAT1 (Ensembl ID ENSG00000245532) down-regulated in retinoblastoma 22 , MALAT1 (Ensembl ID ENSG00000251562) up-regulated in lung and colorectal cancer 43,44 . However, these lincRNAs do not exhibit expression aberration in breast cancer. Likewise, the PCAT1 (Ensembl ID ENSG00000253438), a prostate-specific lincRNA regulating cell proliferation, shows over-expression in a subset of prostate cancers 45 . In our study, we almost did not detect the expression level of the PCAT1 in either breast normal or breast tumor samples. These results partially support the observation that many lincRNAs are expressed in a tissue-and cancer-type restricted manner, making them useful as prognostic markers 12 .
However, some limitations remain in this study. Firstly, no functional validations on these DE lincRNAs prevent us draw the further conclusion about how the aberrant expression of lincRNAs contribute to tumorigenesis. Secondly, the relative small sample size and the unavailability of breast cancer subtype data in the replication stage also prevent us replicate the subtype-specific lincRNAs. Finally, due to the tumor heterogeneity and the cell mixture in tumor tissues, global comparison of gene expression profiles among breast tumor, adjacent and normal tissues is insufficient, particularly for lincRNAs whose expression patterns broadly show the cell-or tissue-type specificity. Thus, single-cell or sorted cell population based transcriptomic analysis 46,47 will be favorable to determine these lincRNAs as robust biomarkers.

Conclusions
We identified a signature of lincRNA expression profile for breast cancer. Further functional surveys of these lincRNAs will be warranted to discover the biological mechanisms of lincRNA in breast cancer development and progression.

Materials and Methods
Discovery dataset. After approval, raw RNA-seq data (Level 1) of breast tumor tissues (N = 915) and adjacent normal breast tissues (N = 85) were acquired from the TCGA. Clinical data (Biotab format) for these 915 breast invasive carcinoma (BRCA) were also acquired from the TCGA.
Replication dataset. Our replication study consisted of 50 breast cancer cases and 23 healthy controls.
The patients were pathologically confirmed for primary breast cancer diagnosed at one of three hospitals in Indianapolis, Indiana, between 1998 and 2009: Indiana University Hospital, Wishard Hospital (now Eskenazi Hospital), and the Indiana University Simon Cancer Center (IUSCC). Controls were randomly selected from a pool of healthy women who donated normal breast tissues to the Susan G. Komen Tissue Bank between 2005 and 2009, and were free of breast cancer at the time of donation. The participants completed a questionnaire on medical histories and health-related exposures at the time of donation. Controls were frequency matched to cases based on self-reported ancestry and age (within 5 years).
Breast tissue (untreated tumor or normal) biospecimens were collected from each case and control under standard operating procedures at IUSCC. All breast tissue samples were snap-frozen immediately after removal and stored in liquid nitrogen until processed, and were determined to be of high quality through histological and molecular quality control tests. Tumor samples were pathologically verified for high tumor content. Information concerning demographics, clinical data, and personal risk factors, including age and race, were either extracted from medical records (for cases) or obtained through the questionnaires completed by the participants (for controls). Total RNA was extracted from freshly frozen breast tissue samples using the Qiagen ® miRNeasy Mini Kit.
Construction of cDNA libraries and subsequent RNA sequencing of paired-end reads (2 × 50 nt reads) were performed according to the standard Illumina protocol using the HiSeq2000 sequencing systems. The raw sequencing output was 25-35 million reads per sample.
Other dataset. The annotation data for known lincRNAs (n = 6,020) were extracted from the gencode.v17. long_noncoding_RNAs.gtf.gz file from the GENCODE database. Other data used in this study include: human Refseq genes obtained from the NCBI database; protein-DNA interactions from ChIP-seq data in MCF-7 breast cancer cell lines downloaded from ENCODE project 48 ; DNA binding by ERα using ChIP-seq in MCF-7 cells, and RNA-seq data across 14 breast cancer cell lines 49,50 downloaded from the Gene Expression Omnibus (GEO) database 51 .
RNA-seq data processing and lincRNA annotation. In the discovery stage, for the 85 pair of matched breast tumor and normal tissues, mapped reads in BAM format (Level 1) were assembled with Cufflinks (v2.1.1) 52 . The lincRNA annotation was conducted by the following procedures: 1) we retained assembled transcripts whose genomic loci are overlapped or imbedded with known annotated lincRNAs; 2) we removed transcripts in length of < 200 nt; 3) we eliminated transcripts showing coding potential (score < 0.5) predicted with iSeeRNA program (version 1.2) 53 ; 4) if two or more transcripts (isoforms) mapped to a lincRNA locus, we assigned the mean coverage and expression value to that lincRNA. We applied the same procedures for the remaining RNA-seq data, except for publically accessible RNA-seq data that were processed as described elsewhere 54 . To reduce the noise caused by the lincRNA expression variability between samples, we plotted the calling rate (occurrence of lincRNA transcription based on Cufflinks FPKM ≥ 0.3) versus lincRNAs ranked by missing rate across 170 samples. As Ramskold et al. proposed 55 , this cutoff (FPKM ≥ 0.3) is an optimized threshold for detectable expression above background. Then, we set the calling rate ≥ 20% as a threshold to ensure high-confidence lincRNAs for further analyses.
For RNA-seq data in the replication stage, quality control (QC) filtering was first performed on raw RNA-seq data to remove adapter sequences and poor quality bases using the FastqMCF clipping algorithm. RNA-seq reads were then mapped by Bowtie v1.0.0, to GENCODE lncRNA reference (release 17) for lncRNA (including lin-cRNA) annotations. Transcript abundances were quantified using NGSUtils. Samples were further filtered based on percentage of genes detected (less than 50%) and percentage of reads mapped to the reference (less than 25%). Extreme outliers were further identified and filtered from the dataset using principal component analysis (PCA). After these steps, a total of 7,450 lncRNAs retained and were used in further analyses.
ChIP-seq data processing. For ChIP-seq data from the ENCODE project, we directly downloaded files with called peaks for subsequent analyses. For two independent ERα ChIP-seq data whose processed peaks were unavailable, we downloaded the raw data in FASTQ format and conducted the peak calling as follows. Reads were mapped to the human reference genome (hg19) using Bowtie2 program 56 in the default parameters. Aligned data were processed and converted into BAM files using SAMtools program 57 . Then, we used the MACS14 (version 1.4.2) program 58 to call peaks in 20 bp resolution. We visualized the results in the UCSC Genome Browser.
Differential expression of lincRNAs. The fragments per kilobase of exon per million fragments mapped (FPKM) values were calculated from the Cufflinks program to represent lincRNA expression levels. The differential expression of each lincRNA between breast tumor and adjacent normal tissues was defined as: fold change (FC) ≥ 2 with Benjamini-Hochberg (BH) adjusted P < 0.01 based on non-parametric Wilcoxon rank sum paired test.
Following the permutation-based method developed by Xie et al. 59 , we estimated the FDR in identifying lin-cRNAs with differential expression. Briefly, FDR is estimated as , where d is a fixed threshold, Z i is a test statistic and B = 1000 permutations.
. The 1000 permutations were achieved by shuffling the tumor/normal samples in the lincRNA-sample matrix (584 lincR-NAs and 170 samples).
For RNA-seq data in the replications stage, differential expression (DE) analyses were performed using edgeR v2.6.12, implemented in the Bioconductor package to identify differentially expressed lncRNAs between tumor and adjacent normal, as well as tumor and normal breast tissue. The trimmed mean of M-values (TMM) method was used to normalize the raw read counts. Biological coefficients of variation between the samples were estimated using an empirical Bayes approach under the assumption that the data follows a negative binomial distribution. Differential expression between tumor and adjacent normal or normal breast tissue was analyzed using a generalized linear model to regress lncRNA (expression on tissue type, adjusting for age, race, and sequencing batch). Statistical significance was defined as FDR p-value < 0.05 and a two-fold change (FC) of expression level between comparison of tumor and adjacent normal or normal breast tissue.
Construction of transcription factor and lincRNA network. We downloaded ChIP-seq peak files of Scientific RepoRts | 6:37821 | DOI: 10.1038/srep37821 Co-expression analysis. We used similar methods for lincRNA annotation to annotate Refseq proteincoding genes and calculate their FPKM values. The neighboring protein-coding genes were defined as the ones with the closest physical distance to known lincRNAs. We coupled lincRNA-encoding locus and neighboring protein-coding genes, and referred to it as a lincRNAs-mRNA pair. We further classified lincRNAs-mRNA pairs into three types, according to their transcriptional orientation: head-to-head (H2H), head-to-tail (H2T) and tail-to-tail (T2T). The linear relationship of log2-transformed fold change in a comparison of 85 paired samples for lincRNA-mRNA pairs was used to evaluate the coordinated changes.
We defined the non-neighboring genes as the ones over a 1 Mb physical distance from lincRNAs on both strands. We coupled them as non-neighboring lincRNAs-mRNA pairs. Similarly, we determined the coordinated changes for non-neighboring lincRNA/mRNA pairs. We repeated the analysis using randomly selected 1,000 lincRNA-mRNA pairs (regardless of the distance) from the entire transcriptome data.
Functional prediction of lincRNAs. To predict biological functions for these DE lincRNAs, we calculated the correlation coefficients between DE lincRNAs and all Refseq protein-coding genes using the Spearman rank correlation analysis. We regarded the Spearman rank correlation coefficients calculated from randomly shuffling lincRNA/mRNA pairs (1000 times) as null distribution. Compared with the null distribution, we set a threshold for the Spearman rank correlation coefficient ≥ 0.4 (or ≤ − 0.4) ( Figure S6) to reflect the strong co-expression between lincRNAs and Refseq genes in high confidence. On this basis, a set of Refseq genes passing this threshold were regarded as the functional association to lincRNAs, and used for functional enrichment analysis using DAVID annotation 61 . GO terms with BH-adjusted P ≤ 0.05 served as functional enrichment for lincRNAs.

Survival analyses.
Excluding participants with unknown survival information (n = 16), the remaining 899 subjects were retained in survival analyses. We split patients into two groups (higher and lower expressions of lincRNA) based on the median level of lincRNA expression. The Kaplan-Meier curve and hazard ratio (HR) of higher versus lower expressed groups were generated in R (versions 2.15.0) using the survival package.
Ethical consent. Utilization of data was conducted in accordance with TCGA data access policies. Signed informed consent was obtained from each subject in the replication study. The study was approved by Indiana University institutional review board.