Breast cancer is a heterogeneous disease with a poorly defined genetic landscape, which poses a major challenge in diagnosis and treatment. By massively parallel mRNA sequencing, we obtained 1.2 billion reads from 17 individual human tissues belonging to TNBC, Non-TNBC, and HER2-positive breast cancers and defined their comprehensive digital transcriptome for the first time. Surprisingly, we identified a high number of novel and unannotated transcripts, revealing the global breast cancer transcriptomic adaptations. Comparative transcriptomic analyses elucidated differentially expressed transcripts between the three breast cancer groups, identifying several new modulators of breast cancer. Our study also identified common transcriptional regulatory elements, such as highly abundant primary transcripts, including osteonectin, RACK1, calnexin, calreticulin, FTL, and B2M, and “genomic hotspots” enriched in primary transcripts between the three groups. Thus, our study opens previously unexplored niches that could enable a better understanding of the disease and the development of potential intervention strategies.
Breast cancer is the leading cause of cancer death among women, accounting for 23% of the total cancer cases1. The major treatment challenge remains at the level of defining the specific types and associated biology behind the disease2,3,4. Breast cancer is known to be a heterogeneous disease with a variety of morphological features and clinical manifestations due to genetic, epigenetic, and transcriptomic alterations3,4,5,6,7. This phenotypic diversity severely affects the diagnosis and prognosis of breast cancer. The main difficulties in resolving these issues include the complexities of determining specific markers and the lack of a complete understanding of the cellular hierarchy of the mammary epithelium5,7,8,9,10,11. In addition, the remarkable variations in response to therapy12,13 also emphasise the pressing need for further understanding of breast cancer evolution, the genomic basis of heterogeneity, and the biological basis of this disease.
Numerous reports have demonstrated that the metastatic status, histological grade, tumour stage, size, and receptor expression are the main critical determinants of breast cancer treatment14,15,16,17,18. Seminal gene expression studies by Perou et al. (2000) and Sorlie et al. (2001) have established a classification of breast cancer into five broad “intrinsic phenotypic subtypes”19,20. These subtypes include Luminal A, Luminal B, Human Epidermal Growth Factor Receptor 2 (HER2)-positive, basal-like and normal breast-like breast cancers19,20,21,22. Correlating these subtypes with the traditional tumour histology provided a paradigm shift in breast cancer diagnostics. Furthermore, microarray investigations have offered an initial basis for treatment prediction22,23,24,25,26 and identification of the different breast tumour stages that are critical for breast cancer treatment27,28,11,29. However, translating molecular profiling into clinical practice has proven to be a formidable challenge as a result of complex heterogeneity30,31.
Immunohistochemically, three broad types of breast tumours have been classified by the status of therapeutically significant components, the Estrogen receptor ER, the progesterone-receptor (PR) and the HER23,32. Breast tumours lacking expression of all three receptors are defined as triple-negative breast cancer (TNBC)33,34,35,36. TNBC is often classified as basal-like breast cancer, which represents 10–25% of all tumours and is presumed to be derived from a distinct cell type and a specific developmental stage of mammary epithelial cell development19,22,34,36. In contrast, the gene expression profiles of HER2-positive (ER and PR negative) and Non-TNBC (positive for all three receptors) tumours belong to the luminal-like subgroups, representing approximately 15% of patients37,38. The main characteristics of TNBC are frequent occurrence in younger patients (<50), increased aggressiveness, significantly shorter survival periods, and higher recurrence rates compared with the Non-TNBC subtypes. These difficulties warrant an immediate and intensive focus on this difficult-to-treat type of breast cancer33,34,39.
To identify the precise genetic elements and study the exclusive nature of TNBC and the other two types of breast cancer, we employed a massively parallel in depth mRNA sequencing approach40,41. This global transcriptomic profiling can illustrate the intricate inner workings of the transcriptome at a very high resolution, enabling us to explore the distinct nature of these breast cancer subtypes, and provide a new inventory of diagnostic and therapeutic targets.
Comparative transcriptome analysis strategy
We aimed to compare the transcriptomic expression profile of TNBC, Non-TNBC and HER2-positive breast cancers. Accordingly, 17 individual human breast cancer tissues, including six TNBC samples, six Non-TNBC samples and five HER2-positive samples, were chosen. The mRNA sequencing of the samples was performed using the Illumina platform, generating a total 1.2 billion high quality raw reads (Supplementary Method, Figure 1A). In addition, we used publicly available 50-base pair paired-end single normal breast tissue RNA sequence data (http://www.ncbi.nlm.nih.gov/sra) for comparisons when applicable. The reads were aligned against the Ensembl GRCh37.62 B (hg19) reference genome using TopHat42. The reference genome-guided transcript assembly of the aligned reads was performed using cufflinks, a well-established transcript assembler43. Furthermore, all subsequent analyses were conducted using only the transcripts that were identical to the reference identified by cuffcompare. The transcripts were binned according to their abundance, and the overall relative abundances of the transcripts that were expressed in all three breast cancer types on each chromosome were compared.
From the assembled transcripts and their abundances, the following statistical and secondary analyses were performed. First, the correlation between the samples was determined using Principle Component Analysis (PCA), unsupervised hierarchical clustering and Spearman's correlation. Second, the transcripts that were differentially expressed between the three breast cancer groups and the pairwise comparison of all 17 individual samples were identified using univariate Fisher's exact test (F-test) and cuffdiff, respectively. The cuffdiff analysis employs various parametric approaches, such as the normal and negative binomial distributions. The F-test was conducted using the statistical package, R. Third, the transcriptional and post-transcriptional changes were compared in each breast cancer type. In this analysis, we focused on two critical transcriptional elements, which included the highly abundant primary transcripts and the genomic loci that comprised the highest number of primary transcripts, i.e., the genomic hotspots. Finally, we also investigated the significance of the predominant intronic reads found in all three types of breast cancer. Together, these analyses allowed us to compare the transcriptomic and post-transcriptomic profiles of TNBC, Non-TNBC and HER2-positive breast cancer.
Generation of the TNBC, Non-TNBC and HER2-positive breast cancer digital transcriptome
Among the 1.2 billion high quality raw reads, 71.03% of the reads were mapped to the human genome (Figure 1B) (Supplementary Tables 1 and 2). The distribution of these mRNA reads indicated that most reads (58%) mapped to exons. However, intriguingly, 34.6% of the reads mapped to intronic regions while 6.4% and 2.8% mapped to intergenic and junction regions, respectively (Figure 1C) (Supplementary Table 3). In each sample, the aligned reads supported an average of 80,279 transcripts (p value and FDR less than 0.05) that were identical to previously annotated transcripts (Supplementary Table 4, the transcript reassembly of HSP901B is shown as an example in Supplementary Figure 1).
Interestingly, an average of 16,245 distinct genes per sample were detected, indicating the diversity introduced by isoforms of various genes (Supplementary Table 4). From the total of 17 sequenced breast cancer samples, 1,364,752 known transcripts were identified, comprising 5,226,535 and 11,761,893 unique and total exons, respectively (Figure 1D, Figure 2A, 2B and 2C, Supplementary Table 4, Supplementary Figure 2). The transcript abundance was calculated by estimating the fragments per kilobase of exon per million mapped fragments (FPKM)43, and all expressed transcripts were binned on the basis of their abundance (FPKM). This analysis revealed that the majority of the assembled transcripts were low in abundance, i.e., below 1 FPKM (Supplementary Method, Supplement Table 5, and Supplementary Figure 3–8). In parallel, the transcripts from all the 17 breast cancer samples were also reassembled using another reference genome (UCSC hg19), and similar transcript expression trends as described above were observed (Supplementary Table 4B).
The overall abundances (FPKM) of commonly expressed transcripts on each chromosome were compared between the three groups. The TNBC group exhibited higher transcript expression levels on chromosome 6 compared with the Non-TNBC and HER2-positive breast cancer types (Figure 2B, Supplementary Figure 9). Frequent gains in chromosome 6p have been shown to be associated with poor prognosis in several cancers44, including invasive ductal carcinoma45,46. In general, the transcript abundance was higher in TNBC than in the other two groups. However, the overall number of transcripts and their expression profile in all three cancers followed a similar trend (Supplementary Figure 2).
Clustering based on transcript expression profile
We examined the correlation between the samples based on transcript and gene expression using Principle Component Analysis. Similar to initial immunohistochemistry and q-PCR classification, most of the samples from each breast cancer group clustered together in our analysis, except for a few outliers (sample A3 in TNBC, sample B4 in Non-TNBC, and samples C1 and C5 in HER2 group) (Figure 2C). To further explore the pairwise relationship, we calculated the Spearman's correlation, which ranks and quantifies the degree of similarity between each pair of samples (Supplementary Table 6). The results indicated variation between individual samples but no distinct outliers (Figure 2D). However, the transcript expression-based correlation showed more variation within the groups than the gene expression-based correlation (Supplementary Table 6 and Supplementary Figure 10A). In agreement with the PCA analysis, the average linkage distance matrix analyses and pairwise cuffdiff analysis also highlighted three distinct groups and similar outliers, i.e., A3, B4 (and also B6) and C5 (Supplementary Table 7, Supplementary Figure 10B and 11, Supplementary Method43). Together, these analyses illustrate the variation in individual transcript expression levels, which is possibly due to a large number of low abundance transcripts in all of the samples, a phenomenon generally observed in RNA sequencing data studies47.
Defining the transcriptomic signatures of TNBC, Non-TNBC and HER2-positive breast cancer
To determine the differentially expressed genes and transcripts between the three breast cancer groups, univariate F-tests were performed (Supplementary Method, Figures 3, 3A and, Supplementary Figures 12 to 15). We identified 2617 transcripts that were differentially expressed between the TNBC and Non-TNBC groups (Figure 3B) (Supporting Files 1 and 2). Among the identified transcripts, 962 transcripts exhibited higher abundance (upregulated) in the TNBC group and 1655 manifested lower relative abundance (downregulated) in the non-TNBC group. When the TNBC group was compared with the HER2-positive group (Figure 3C), 3087 transcripts were identified as differentially expressed (Supporting Files 3 and 4). Among these transcripts, 2416 were upregulated in TNBC. The Non-TNBC and the HER2-positive group comparison revealed 2971 transcripts (Figure 3C, Supporting Files 5 and 6) that were differentially expressed, and 2509 transcripts were upregulated in Non-TNBC. The comparative analyses of these three breast cancer subtypes at the isoform and gene level (Supplementary Figure 15, Supporting Files 7 to 9) provide a complete view of the global transcriptomic changes between the three groups. The distribution and abundance of the top one hundred (determined by p value) differentially expressed transcripts from the aforementioned three comparisons are shown as stacked histograms using Circos plots48 (Figures 3D, 3E and 3F).
Furthermore, the differential transcript expressions that occurred in more than one comparison were separated from the “bona-fide” transcripts that were specific to one comparison, i.e., the transcripts specific for TNBC vs. Non-TNBC (1000 transcripts, 945 genes), TNBC vs. HER2-positive (1316 transcripts, 885 genes) and Non-TNBC vs. HER2-positive (1011 transcripts, 763 genes) (Supporting Files 10–12). Interestingly, 78 transcripts, including oncogenic Ras family member, RAB21, and TMEM219, were expressed differentially in all three comparisons and consequently qualified to be common breast cancer modulators (Supporting File 13).
We next compared the new transcript atlases between the three groups with the previously reported breast cancer-associated targets derived from Online Mendelian Inheritance in Man (OMIM) database. About 3 to 6% of the known targets were represented in our dataset, highlighting new isoforms that were identified in this study (Supporting Excel File 14). Notably, the reduced expression of all three receptors in the TNBC group and high abundance of the three isoforms in the HER2 group were observed in all five samples of HER2-positive cancer samples (Figures 4A, 4B and 4C). These observations indicate the accuracy of the mRNA sequencing-based transcript reconstruction and abundance calculation.
Common transcriptional regulatory elements in TNBC, Non-TNBC and HER2-positive breast cancer
Several critical regulatory elements, such as splicing, and different promoter usage have been established to be the major components that contribute to transcriptional dynamics49. From the assembled transcripts and their abundances, the transcriptional and post-transcriptional dynamics of the genes can be deciphered. The difference in the expression of various isoforms of a given gene reflects the post-transcriptional regulation, whereas the relative abundances of the primary transcripts and the genomic loci comprising the highest number of primary transcripts indicate transcriptional regulatory elements. The comprehensive transcriptome analysis using cuffdiff allowed us to estimate the number of primary transcripts in each breast cancer group. The primary transcript group in this study was defined as the group of transcripts that share the same Transcription Start Site (TSS). Typically, each TSS group comprised several specific isoforms that originated from a particular TSS. The TSS group relative abundances were estimated from the sum of the FPKMs of all of the isoforms belonging to an individual primary transcript. We first investigated the common high confidence primary transcripts and their abundances and associated genes in each breast cancer group (Supporting Files 19 to 21).
Surprisingly, the six most abundant primary transcripts were common among all three breast cancer groups (Figure 4D). These primary transcript pools appear highly active in generating abundant isoforms in breast cancer. These transcripts encode genes including secreted protein acidic and rich in cysteine (osteonectin, SPARC), guanine nucleotide binding protein beta polypeptide 2-like 1 (RACK1 or GNB2L), calnexin (CANX), ferritin L subunit (FTL), calreticulin (CALR) and beta-2 microglobulin (B2M). In each primary transcript group, only selected isoforms were preferentially expressed at high abundance, possibly because of their functional role. The top six primary transcripts have been reported to be critical players in breast cancer because of their roles in extracellular matrix remodelling and cell motility. For instance, the cell surface proteins SPARC, RACK1, B2M, and FTL are well associated with breast cancer50,51,52,53. In addition, B2M, FTL, and RACK1 are reported to be tumour prognostic markers52,53,54. Calnexin and calreticulin, which promote the correct folding of proteins that enter the secretory pathway, have also been linked to the breast cancer metastatic phenotype55. More interestingly, the highly abundant primary transcripts identified in normal breast tissue (DNAJB1, SCGB2A2, MUCL1, SCGB1D2 and ACTG1) were different from those in the three breast cancer groups (Supporting File 22). Thus, our study identifies a primary transcript group that is breast cancer-specific and could therefore provide potential common markers upon validation in a larger sample set.
Furthermore, the variation in abundances of isoforms originating from the same primary transcript is exemplified by the primary SPARC transcripts. All of the eight known SPARC isoforms that were expressed in the breast cancer samples (Figure 4E) originate from five different primary transcripts (Figure 4F shown as TSSI to TSSV). Interestingly, the highly abundant transcript from the primary transcript group I of SPARC, ENST00000520687, is defined as a non-expressed protein in the hg19 reference genome, but the abundance of that specific isoform was high in our samples of all three breast cancer types (Figures 4E and 4F). In primary transcript group III, the SPARC splice variant 5 (ENST00000521569) expresses 79% to 86% whereas SPARC splice variant 8 (ENST00000538026) expresses 13.8 to 20% out of total SPARC gene expression in all three breast cancers. These results reveal that the ENST00000521569 isoform of SPARC is most highly expressed in all three breast cancers.
Furthermore, some highly abundant primary transcript pools were unique to each of the breast cancer subtypes (Figure 4D). For example, the primary transcript abundance of apolipoprotein E (APOE) was high in the TNBC and Non-TNBC groups but not in the HER2-positive group. In the case of the HER2-positive group, primary transcripts for protein phosphatase 1B (PP1B) and ornithine decarboxylase antizyme 1 (OAZ1) were highly abundant and unique to this group compared with the other two groups. From this primary transcript analysis, we could unravel the transcriptional dynamics to the finer details, i.e., the expression and significance of the specific primary transcript pools and the precise isoforms that are abundant in specific breast cancer type.
“Genomic Hotspots”: the genomic loci enriched in primary transcripts are identical in all three groups
Next, we searched for the genomic loci that are enriched in primary transcripts, i.e. the “genomic hotspots”, in each type of breast cancer analysed in this study. These genomic regions containing highly abundant primary transcripts could steer the cellular machinery towards oncogenic processes through predominant splicing. Surprisingly, the top eight “genomic hotspots” that produced the highest number of primary transcripts are common in TNBC, Non-TNBC and HER2-positive breast cancers. They are presented in Figure 5A, along with the number of protein-coding genes that are encoded from one of these loci (Supporting Files 23, 24 and 25 comprising the top 20 genomic hotspots). For example, the genomic locus that has the highest primary transcript enrichment, Chr5: 139781398–140099052, encodes 12 genes (Figure 5B), including HRAS, HRAS2, SLC35A4 and NDUFA2. The expression levels of these genes, calculated as the sum of all of the isoforms associated with the genes in all 17 samples, are presented in Figure 5C. These findings clearly illustrate that these loci are highly transcriptionally active and participate in robust splicing to produce several isoforms. These loci are strikingly different from the normal breast tissue (Supporting File 26).
Novel intronic predominance highlights common new transcripts in TNBC, Non-TNBC and HER2-positive breast cancer
Apart from the known transcripts, we also investigated the significance of intronic reads found in all three types of breast cancer. We compared the read distribution in publicly available single normal tissue against all 17 breast cancer samples. This comparison allowed us to investigate whether the phenomena were found in the breast tissue or specific to the cancer samples. In the normal breast tissue sample, 80% of the reads mapped to exons while 16% and 0.1% of reads corresponded to introns and intergenic regions, respectively. These results indicate that the intronic predominance was specific to the cancer samples, possibly because of intron retention and splicing modifications. Further, close inspection of this phenomenon revealed the expression of intronic regions in several annotated transcripts. The normal breast tissue comprises 10,337 defined exons in the place of introns in known transcripts (high confident transcripts with FPKM above 0.1). In contrast, an average of 96,215 exons, i.e., ∼10-fold more novel exons, were found within introns in the cancer tissue samples (Supplementary Table 8). Furthermore, the number of annotated transcripts comprising these exons in normal tissue was 1211. In contrast, an average of 15,142 known transcripts were found to encompass exons in the place of introns in cancer tissues. When we compared these novel transcripts against the EST (Expressed Sequence Tags) database, we found that about 93–97% of the transcripts mapped to similar locations as reported for ESTs. Therefore, these exons may be expressed in the three breast cancer types. Moreover, within each breast cancer group, there were a few hundred common transcripts (Supplementary Table 8 and Supporting Files 15 to 17) that underwent such changes as well as 78 common transcripts in all three breast cancer groups (Supporting File 18). Given these described predominant modifications in several transcripts, it is not surprising that we found a high intronic read count in these samples when compared with hg19.
Novel intergenic unannotated transcripts in the breast cancer transcriptome
In addition to introns, the intergenic region also appears to comprise an average of 24,540 exons that could assemble into high confidence, previously unknown transcripts in all of the breast cancer samples. Several (average 723) unannotated transcripts (with FPKM above 0.1) could be assembled from reads that were recovered from the intergenic region of each sample, which indicates the existence of possible new transcripts in breast cancer conditions (Supplementary Table 8). Notably, some transcripts comprised more than one exon, but the singletons were the predominantly identified transcripts (Supplementary Table 9). A similar phenomenon was found in normal breast tissue. Therefore, these previously unknown transcriptional changes identified in this study reveal several new elements that occur specifically in breast cancer. However, future mRNA sequencing of several normal breast and other tissue samples are needed to establish the breast tissue specificity of the alterations of these newly discovered transcripts.
Identifying common critical pathways in breast cancer
The pathway analysis was performed using the identified differentially expressed transcripts obtained from the TNBC vs. Non-TNBC, TNBC vs. HER2-positive and Non-TNBC vs. HER2-positive comparisons. In all three pairwise comparisons, cell motility, development, apoptosis and cancer-related signalling pathways are influenced, that highlights the common critical functions that are “hijacked” in breast cancer (Supplementary Figures 16 to 21). Moreover, the differentially expressed transcripts identified in the TNBC vs. Non-TNBC comparison are also involved in small GTPase-mediated signal transduction, proteosomal ubiquitin-dependent protein catabolism, negative regulation of apoptosis, cytoskeleton protein binding and oxidoreductase activity (Supplementary Figures 16 and 17). The transcripts that were differentially expressed in the Non-TNBC vs. HER2-positive comparison appear to affect the regulation of mammary gland and epithelial proliferation, apoptotic mitochondrial changes, mitotic cell cycle, cell division and enzymatic activity (Supplementary Figure 18). Furthermore, canonical pathways, such as oncostatin M signalling, PI3K/MAPK signalling and functional pathways involving cancer, cell cycle and cell death, are influenced by all three groups of differentially expressed transcripts (TNBC vs. Non-TNBC, TNBC vs. HER2-positive and Non-TNBC vs. HER2-positive) (Supplementary Figures 19, 20 and 21). In contrast, oestrogen receptor signalling, cellular growth and modification are specifically modulated by transcripts that arise from the Non-TNBC vs. HER2-positive differential expression comparison (Supplementary Figure 21). In summary, this analysis highlights the prevalence of breast cancer signalling-associated pathways that are commonly influenced in all three subtypes analysed here.
qRT-PCR validation of the mRNA sequencing analysis
The mRNA sequence analysis was validated by experimental confirmation of the levels of differentially expressed transcripts in TNBC, Non-TNBC and HER2-positive samples. We used qRT-PCR to validate the expression levels of 28 transcripts (randomly selected across all the three bins shown in Supplementary Figures 3 to 8). The transcripts were selected from pairwise differential expression comparisons of all of the samples of the relevant group (Supplementary Methods). We observed a high correlation and a similar expression trend between the mRNA-sequence based abundance estimation (FPKM) and qRT-PCR assays revealing the accuracy of the analysis (Figure 5C and Supporting File with primers 27, Supporting File 28 shows individual RT qPCR data of all the transcripts that were validated).
For the first time, we report the transcriptional and post-transcriptional profiling of TNBC, Non-TNBC and HER2-positive breast cancers using a deep mRNA sequencing approach (35, 36). In breast cancer, microarrays have been extensively used for molecular subtyping and identification of breast cancer-specific gene signatures. Although mRNA sequencing covers the total transcript expression compared with the preselected microarray approach, which focuses on specific gene sets, we compared our mRNA sequencing-based differentially expressed transcripts against the previously reported gene signatures of TNBC, Non-TNBC and HER2-positive breast cancer. When the signature gene lists from Perou et al., 200019, Sorlie et al., 200120,56, Hu et al., 200657 and Parker et al., 200926 were compared, only few overlapping genes were identified (Supplementary Method and Supplementary Table 12), because these studies classify the samples on the basis of five intrinsic subtypes and identify genes that are modulated in all the five subtypes from a preselected gene set. In contrast, our mRNA sequencing captures all the expressed transcripts which are ∼75,000 transcripts per sample compared to the preselected gene sets of microarray platform. Moreover, this study classifies the samples on the basis of receptor status. Therefore the differentially expressing transcripts from this study will be different to the previously identified gene signatures of microarray. However, we performed parallel comparative microarray analyses using publicly available datasets for which the receptor status can be derived (Supplementary Method and Supplementary Table 10, 11 and Supplementary Figure 22). The results (Supplementary Table 11 and Supporting file 29 to 31) show several genes overlapping and 90 to 100% agreement in the expression trend of genes that were identified in common in the abovementioned pairwise comparisons (e.g., TNBC vs. Non-TNBC) in the microarray and mRNA sequencing. Although mRNA sequencing identifies previously reported breast cancer modulators as well as several new players, further many larger scale mRNA sequencing studies will be needed to draw firm conclusions on mRNA-seq based gene signatures.
Compared with the microarray approach, the emerging mRNA sequencing-centred technology41,58,59 allows us to de-convolute the transcriptional and post-transcriptional elements of breast cancer at a higher resolution. The fundamental theme that emerges from this detailed mRNA-Seq analysis is the similarity observed at the transcriptional level (i.e., primary transcripts and abundantly splicing genomic hotspots), which differs from the heterogeneity in post-transcriptional isoform levels. This heterogeneity appears to be mainly due to post-transcriptional regulation that fine-tunes the splicing and abundances of various isoforms specific for each cancer. This notion is supported by the identified primary transcript pool of SPARC, GNB2L1, CANX, CALR and B2M, which have been proven to be significant players in breast cancer onset and progression by modulating cell migration, motility, anti-apoptosis and cellular stress management50,60,61,62. In fact, two of the highly abundant primary transcripts, B2M and FTL, have also been confirmed to be markers of cancer progression and thus raise new therapeutic possibilities that could arise from this study. To achieve the final frontier of using the transcriptomic data in clinical settings, it will also be important to design a prospective analysis involving a larger series of patient cohorts.
Human Patient Samples
Dr. Suzanne Fuqua (Baylor College of Medicine) provided the human breast cancer tissue RNA samples. All of the human samples were used in accordance with the IRB procedures of Baylor College of Medicine. The breast tumour types, TNBC, Non-TNBC and HER2-positive, were classified on the basis of immunohistochemical and RT-qPCR classification (data not shown).
Illumina Genome sequencing RNA sequencing library preparation
Enrichment of rRNA-free transcriptome RNA: Whole transcriptome RNA was extracted from total RNA by removing large and small ribosomal RNA (rRNA) using RiboMinus Eukaryote Kit (Invitrogen, Carlsbad, CA). Five micrograms of total RNA was hybridised to rRNA-specific biotin labelled probes at 70°C for 5 minutes. The rRNA-probe complexes were then removed by streptavidin-coated magnetic beads. The rRNA-free transcriptome RNA was concentrated by ethanol precipitation.
cDNA synthesis and DNA library construction from transcriptome RNA
cDNA was synthesised from the isolated RNAas described in the supplementary method. Double-stranded cDNA was treated with a mix of T4 DNA polymerase, Klenow large fragment and T4 polynucleotide kinase to create blunt-ended DNA, to which a single A base was subsequently added at the 3′ end using Klenow fragment (3′ to 5′ exo-) and dATP. The A-tailed DNA was ligated with paired end adaptors using T4 DNA ligase, provided by the Illumina RNA-Seq kit (Illumina, San Diego, CA). Size selection (200 base pair) of the adaptor ligated DNA was performed by cutting the target fragment out of a 4–12% acrylamide gel. The amplified DNA library containing ideal fragment sizes was obtained by in-gel PCR using the Phusion High-Fidelity system (New England Biolabs). Each library that was prepared was sequenced, and image analysis and base calling were performed with Illumina pipeline version 1.3.2.
Read alignment and transcript assembly
We have aligned the pair end reads using TopHat version 1.1.442, allowing two mismatches in the alignment. The aligned reads were assembled into transcripts using cufflinks43. The alignment quality and distribution of the reads were estimated using SAMtools63, and the PCA plot and sample clustering were performed using (imported genes and transcript expression profiles generated from cufflinks assembled transcripts that were identical to the reference) the Avadis NGS program.
The cuffcompare program was used to identify the transcripts that were identical to the reference human genome for further analysis. To examine differential expression of genes/transcripts between the three groups of samples belonging to the three different subtypes of breast cancer, we first performed cuffcompare and cuffdiff43.
The high confidence transcripts associated with FPKMs above 0.01 in all of the samples were isolated. Next, we calculated the average expression of an individual gene/transcript from the FPKMs obtained using cufflinks. The samples were arranged in groups without preference to the order within the group. The geometric averages of the FPKMs were calculated for each group of samples. The values were not calculated using a reference. Next, univariate F-tests were performed (after log2 conversion of the transcript expression abundance, FPKM) between the three groups, using a p value cut off of 0.05 a false discovery rate below 0.05, with statistical package R.
We acknowledge Krishna S Ghanta for technical assistance and Dr. Andriy Morgun for fruitful discussions and critical reading of the manuscript.
About this article
Nature Communications (2019)