xMD-miRNA-seq to generate near in vivo miRNA expression estimates in colon epithelial cells

Accurate, RNA-seq based, microRNA (miRNA) expression estimates from primary cells have recently been described. However, this in vitro data is mainly obtained from cell culture, which is known to alter cell maturity/differentiation status, significantly changing miRNA levels. What is needed is a robust method to obtain in vivo miRNA expression values directly from cells. We introduce expression microdissection miRNA small RNA sequencing (xMD-miRNA-seq), a method to isolate cells directly from formalin fixed paraffin-embedded (FFPE) tissues. xMD-miRNA-seq is a low-cost, high-throughput, immunohistochemistry-based method to capture any cell type of interest. As a proof-of-concept, we isolated colon epithelial cells from two specimens and performed low-input small RNA-seq. We generated up to 600,000 miRNA reads from the samples. Isolated epithelial cells, had abundant epithelial-enriched miRNA expression (miR-192; miR-194; miR-200b; miR-200c; miR-215; miR-375) and overall similar miRNA expression patterns to other epithelial cell populations (colonic enteroids and flow-isolated colon epithelium). xMD-derived epithelial cells were generally not contaminated by other adjacent cells of the colon as noted by t-SNE analysis. xMD-miRNA-seq allows for simple, economical, and efficient identification of cell-specific miRNA expression estimates. Further development will enhance rapid identification of cell-specific miRNA expression estimates in health and disease for nearly any cell type using archival FFPE material.

SCIenTIfIC RePORTS | (2018) 8:9783 | DOI: 10.1038/s41598-018-28198-z Cellular composition of tissues is highly variable between samples, even when all samples share the same phenotype 12 . A robust way to deconvolute a tissue is to utilize an expression matrix of each composite cell type to computationally separate the tissue into each individual cell type 13,14 . For that purpose, expression estimates must closely hew to in vivo data. We have noted that often cell-culture based expression estimates fail in this capacity. For example, the reads per million miRNA reads (RPM) value of miR-200c, an epithelial cell specific miRNA, was ~60,000 RPM in multiple human bladder samples. In the bladder, the only native epithelial cell type, representing ~20-80% of a bladder biopsy, is the urothelial cell. However, urothelial cells grown in culture demonstrate a miR-200c value of only 5,000 RPM. It is difficult to reconcile this difference other than to acknowledge that this miRNA, associated with a mature cell phenotype, is greatly reduced in a cell-culture sample 15 .
To overcome this problem, there is a need for methods to capture in vivo cell expression miRNA estimates in a robust and cost-effective manner. Excellent methods to obtain cells directly from tissues exist, but each has limitations. Laser-capture microdissection is expensive, tedious, and can only capture sufficient numbers of a particular cell type if they form large structures (ex. glands); otherwise the background contamination of neighboring cells is rate-limiting 16 . Flow capture and magnetic bead separation are useful for tissues that easily dissociate (ex. blood, but not heart), but these methods are also limited by the widely variable miRNA expression that can occur as a result of methodologic manipulation 4,15,17 . Single-cell sequencing has great promise, however current methodologies are limited for miRNAs due to cost, and depth of sequencing per cell 18 .
We have previously utilized expression microdissection (xMD) to isolate prostate stroma and epithelium and assay miRNA by droplet digital PCR (ddPCR) 19 . That study led us to hypothesize we could obtain adequate RNA yields for a global survey of miRNA levels by small RNA-sequencing (RNA-seq). We now introduce xMD-miRNA-seq, a method to obtain nearly in vivo miRNA expression estimates from any cell type directly from formalin-fixed paraffin-embedded (FFPE) tissues by utilizing expression microdissection 20 . We demonstrate this technique as an efficient, robust and cost-effective method for generating deep sequencing data to provide accurate miRNA expression data from cells.

Successful isolation of epithelial cells by xMD.
We performed xMD on six slides from two normal colon samples, performing immunohistochemistry (IHC) for the epithelial-specific cytokeratin marker AE1/ AE3. The xMD method yielded a highly enriched population of colonic epithelial cells (Fig. 1), with an overall capture rate of approximately 20% of the stained cells onto the ethylene vinyl acetate (EVA) membrane. Total RNA extraction and purification was performed on the cells annealed to the EVA membranes yielding 266 (13.3 ng/μl) and 374 (18.7 ng/μl) ng of total RNA respectively for the two samples. These RNA samples were split into two technical replicates for small RNA-seq.
xMD-miRNA-seq results. Small RNA-seq libraries were generated from 50 ng of total RNA in duplicate for both samples using the BIOO Scientific NEXTflex small RNA library preparation kit V3 for Illumina and the Perkin Elmer Sciclone G3. Final library concentrations measured by a ThermoFisher Qubit Fluorometer 2.0 were 0.26 ng/μl and 0.24 ng/μl for the replicates of the first sample and 0.22 ng/μl and 0.25 ng/μl for the replicates of the second sample. Given these low concentrations, the libraries were sequenced with high depth on an Illumina HiSeq 3000. Adapters and random base linkers were removed with Cutadapt version 1.14 and trimmed FASTQ files were processed using both miRge and a miRBase v21 miRNA library and miRge 2.0 with a MirGeneDB library of miRNAs [21][22][23][24] . Using the more restrictive MirGeneDB dataset, we identified between 260 and 321 miR-NAs for each sample. Setting the minimum threshold to 10 RPM for a given miRNA, the range was 236-285 miRNAs per sample, with 203 miRNAs consistently above 10 RPM. The number of miRNA reads varied between 172,678 and 616,921 (Table 1). As a percent of all reads, these yields were low (0.54-0.65% of total reads); requiring deep sequencing of each sample with the current method. Within the sequencing reads, both rRNA sequences and tRNA halves and fragments were abundant accounting for approximately 30-45% and 10-15% of reads respectively.
A number of technical factors were investigated to better understand the variability in xMD-miRNA-Seq. Technical duplicates of the four samples showed similar expression patterns with Pearson correlation r values of log 2 adjusted data of 0.83 and 0.88 respectively (Supplementary Figure 1A). The length of xMD-derived miRNAs were similar to endothelial (SRR5127228) and fibroblast (SRR5127231) primary cells grown in culture although with more shorter reads; whereas epithelial cells that were flow-collected (SRR5127219) had a much higher abundance of shorter miRNA reads (Supplementary Figure 1B). We then assessed which miRNAs varied the most between technical replicates. Using only miRNAs with a minimum average RPM >1000 (N = 90) we noted only sixteen miRNAs had >4 fold change difference. Not surprisingly the fold change differences were highest in samples with the lowest RPM values, with two miRNAs, miR-532-5p and miR-425-5p being the most variable (Supplementary Figure 1C). Finally, we investigated the coefficient of variation across these 90 miRNAs finding a trend for a higher coefficient of variation for the lower abundance miRNAs. However, no miRNAs were particular outliers (Supplementary Figure 1D). Due to the lack of significant outliers, no structural features such as %GC or positional nucleotide usage that may vary and affect miRNA levels were noted 25 .
A highly enriched epithelial signal. Each sample was investigated for its most abundant miRNAs (Table 2). On average, the most abundant miRNA in each sample was the combined let-7a-5p/let-7c-5p miRNA. Following those were five well-known epithelial specific/enriched miRNAs (miR-215-5p/192-5p; miR-375; miR-200c-3p; miR-200b-3p; miR-194) in the top twenty most abundant miRNAs. Five of these miRNAs are generic epithelial markers, but miR-375 is specifically enriched in colonic epithelium in addition to being the most highly expressed miRNA in pancreatic islet cells 26 . Two miRNAs likely represent different sources of contamination. miR-143-3p, which is a highly-abundant mesenchyme-enriched miRNA, has generally been shown to have low 4 SCIenTIfIC RePORTS | (2018) 8:9783 | DOI:10.1038/s41598-018-28198-z to absent epithelial expression 27 . We surmise this is a contaminant resulting from capturing some fibroblasts or smooth muscle cells on the EVA membranes. miR-9-5p is a neural enriched miRNA with trivial expression in colon 4 . This signal may have been the result of differences in NEXTflex and Illumina TruSeq library preparation kits that favor higher counting of this miRNA in the NEXTflex system or contamination from library preparation with brain samples that were concurrently sequenced, as has been described 28 . The remaining miRNAs are well-known and ubiquitously expressed across many cell types.

Similar expression patterns among epithelial cells of different origins.
Having established that a general enrichment for epithelial miRNAs occurred in the xMD-derived data, we then evaluated how similar/ dissimilar our xMD-miRNA-seq derived epithelial cell miRNA expression profiles were to a colonic enteroid (colonoid) and a flow-sorted epithelial cell small RNA-seq sample. In addition, these three epithelial cell types were compared to a fibroblast, endothelial cell, lymphocyte, red blood cell and macrophage miRNA RNA-seq expression set that served as an outgroup. All of these cell types were present in the colon samples we performed xMD on and could represent potential contaminants. We selected seven known epithelial cell-enriched miRNAs and eight miRNAs that are both abundant and enriched in these non-epithelial cell types. As seen, the epithelial cell data generated from all three different cell selection methods show similar estimates for the selected miRNAs, with abundant expression of known epithelial-cell-enriched miRNAs and a paucity of non-epithelial cell-enriched miRNAs (Fig. 2).

xMD-miRNA-Seq derived epithelial cells cluster with other epithelial cells.
We then explored how these epithelial xMD-miRNA-seq cells would cluster among many other primary cell types, based on the small RNA-seq data. We utilized 165 primary cell small RNA-seq datasets and these four xMD-miRNA-seq samples. All samples were processed similarly in miRge and RUV normalized to correct for batch effects. Using t-SNE, we observed that cells clustered by general cell class (epithelial, mesenchymal, neural, inflammatory, red blood cell and platelet) with the xMD-miRNA-Seq derived epithelial cells clustered adjacent to other epithelial cells (Fig. 3).

Discussion
We introduce xMD-miRNA-seq as a method to obtain robust, cell-specific miRNA expression data from FFPE tissues. Using this method, we were able to capture a near in vivo signal of colonic epithelial cells from two samples. Their expression levels were consistent with miRNA expression obtained from two additional primary epithelial cell sources. Our xMD strategy successfully collected epithelial cells from the mixed colon cell population resulting in a high enrichment of epithelial cell miRNA signals with only minimal contamination from other cells of the colon. This method has significant advantages over current options. xMD is cost-effective primarily due to extremely low upfront machine costs with mainly routine immunohistochemistry costs to isolate the cells of interest. It requires little expertise and is easily set up. Although IHC can be challenging, particularly for exotic proteins, that is not necessarily true for xMD. Well-established antibodies to perform IHC exist for most cell types, as noted at the Human Protein Atlas (proteinatlas.org) 29 . These antibodies are all that is needed to isolate specific cell types, although antibodies to more exotic proteins could further subdivide cell types (ex. FOXP3). Beyond the xMD isolation steps, the other aspects of the protocol employ routine commercially-available reagents. Thus, a high level of consistency should exist between users of the method.
While an exciting proof-of-concept has been achieved, there is room for more development and improvement across the technique. We continue to maximize the enrichment of cellular material from the FFPE slide and minimize contamination during the xMD step. Whereas six slides were used to generate the starting material for miRNA-seq, from a cost-perspective, it would be ideal to reduce that to a single slide. We also note that RNA was degraded during the IHC steps due to the introduction of RNAses in the antibody reagents and high temperature antigen retrieval (HTAR). To overcome this, we are developing methods to reduce this loss. Finally, we utilized a new NEXTflex RNA-seq library preparation method, which worked well in obtaining appropriate relative miRNA levels. However, the levels of rRNA and tRNA fragments dominated the specimens, likely due to RNA degradation and a known challenge of FFPE tissues 30 . The percent miRNAs in any RNA-seq method should be 10% or higher, or else very deep sequencing, as performed here, will be required to obtain sufficient coverage. Additional steps to deplete rRNAs and tRNAs may be necessary to increase the miRNA yield for each sample, and thus allow for the use of this method with lower sequencing depth to reduce costs. We do note that generating 31 million+ reads/sample indicates that the amount of RNA obtained was not rate limiting, despite current inefficiencies of   the method. This data highlights one of the main challenges in small RNA-seq: cross methodologic differences in sequencing libraries 31 . The Illumina TruSeq small RNA library preparation is the most commonly used method and the source of most primary cell miRNA RNA-seq data in the literature [3][4][5] . However, the method requires larger RNA inputs than are available for xMD-miRNA-Seq and has known biases, particularly in regards to adapter ligation 32 . The NEXTflex method utilizes randomized adapters, which have reduced adapter ligation bias, and shows a different distribution of miRNA abundance with potential other biases 33,34 . More specifically, we have found Illumina libraries tend to have a couple of dominant miRNAs (with RPM values >100,000) representing a major percentage of the reads. Conversely, in our NEXTflex data, no single miRNA was >100,000 RPM (>10% of all reads) ( Table 2). There was a more even distribution of abundance across the miRNAs. Thus, a comparison across these methods may not be as robust as expected 35 . Since this project began on the premise of finding the   most accurate in vivo cellular miRNA expression estimate for deconvoluting tissues, this will remain a challenge if tissue data is generated in a library format that is not fully compatible with the library preparation method of xMD-miRNA-seq. In fact, while the levels of miR-192/215 were 294,000 and 184,000 in two colon samples sequenced using the Illumina system (SRR837836, SRR837842) the RPM value for miR-192/215 was lower, at 44,000-70,000 in the xMD-Epi samples 4 . This RPM value remains too low to explain the high signal from the colon and would challenge deconvolution algorithms. Thus, having cell and tissue data matched by library preparation method, may be necessary for optimal deconvolution studies. Using FFPE tissues for xMD-miRNA-seq has its advantages and disadvantages. Although RNAs degrade in fixed tissues over time, unlike mRNAs, miRNAs generally maintain their expression 36,37 and correlate well with frozen tissues 38,39 . But, miRNA degradation does appear to vary between specific miRNAs with %GC being implicated, suggesting biases in FFPE-derived data 25 . Additionally, RNA quality is often an issue for FFPE tissue due to cross-linking. As well, devitalization times, times to fixation, or postmortem intervals (for autopsy-derived tissues), diminish the quality of the RNA before formalin fixation and storage has even begun for human tissues. These issues are independent of the cross-linking effect and must be overcome by more proactive efforts to accelerate the fixation of tissues.

Epithelial miRNAs xMD-Epi-1 xMD-Epi-2 xMD-Epi-3 xMD-Epi-4 Colonoid
Nonetheless, the ability to generate robust miRNA RNA-seq data directly from cells obtained from FFPE tissues has broad implications. Now, for the first time, we can access mature primary cell types that do not grow in culture. We can obtain the miRNA expression levels for fully mature cells that are in appropriate environments and maintaining cell-cell interactions. As well, stored FFPE tissues are abundant in pathology laboratories worldwide. Thus, one could mine this resource for just about any expression pattern of any cell type in any disease state envisioned. For example, one could isolate just CD4+ T cells from FFPE tissues of a specific tumor microenvironment to explore their miRNA expression patterns and compare these patterns to that of other T cell subsets or T cells taken from a different tissue. The possibilities to further miRNA expression analysis in a variety of disease states are enormous.

Conclusion
This proof-of-concept study shows xMD-miRNA-seq to capture sufficient and specific quantities of a specific cell type for small RNA-seq. This method is the most cost-effective and widely applicable method yet to obtain accurate near in vivo miRNA expression level estimates from primary cells.

Expression microdissection (xMD). Histologic sections of formalin-fixed paraffin-embedded (FFPE)
sections (4 micron thick) of colon were prepared. Samples were obtained from two anonymous subjects of normal bowel obtained from surgery for a colonic mass lesion. Tissue collection was done in accordance with and approved by the Investigational Review Board of Johns Hopkins Hospital. Sections were immediately placed on DRIERITE TM desiccant in a vacuum desiccator cabinet until IHC was performed. Immunohistochemistry was performed as follows: Slides were baked for 20 minutes at 60 °C. Paraffin was removed from cooled slides by immersion in two changes of xylene followed by graded washes of ethanol and water to rehydrate tissues. Antigen retrieval was performed by immersing the slides in Trilogy (Sigma-Aldrich, St. Louis, MO) in a pressure cooker to 126 °C and 18-23 psi. Slides were stained for cytokeratins AE1/AE3 (Diagnostic Biosystems, Pleasanton, CA) followed by incubation with a polymer HRP IgG (Leica Biosystems, Buffalo Grove, IL.) Protector RNase Inhibitor (Roche Diagnostics, Mannheim, Germany) was added to both primary and secondary antibody incubations at a concentration of 1U/μl. The antibody complex was detected with Deep Space Black (Biocare Medical. Concord California). After dehydration with xylene, slides were air dried and then stored in Drierite (W.A. Hammond Drierite Co., Xenia, OH.). xMD was performed following previously described methods 20 . Briefly, the slides were covered with ethylene vinyl acetate (EVA) polymer film (9702, 3 M) and vacuum sealed using a FoodSaver bag system. Sensepil lamp (HomeSkinovations) was used to irradiate the tissue through a blotting paper diffuser using 5 discharges at intensity level 5 (Supplementary Figure 2). RNA isolation. RNA was extracted from the EVA films using Qiagen miRNeasy FFPE kit. Briefly, the films were completely submerged in PKD buffer: proteinase K (15:1) and heated for 15 min at 56 °C with shaking at 500 RPM. The proteinase K was quenched for 15 min at 80 °C and then immediately placed on ice for 3 minutes. Following 15 minutes of centrifugation at maximum speed, DNAse booster was added (1:10) and incubated at room temperature for 15 minutes. Buffer RBC was added (2:1) and the sample passed over the Qiagen column and washed. RNA was eluted in 20 μL RNase free water. RNA was quantified using the high sensitivity RNA assay on a Qubit 3.0.

RNA-seq for xMD derived cells.
Library preparation and sequencing of the xMD samples was performed at the Lieber Institute for Brain Development. Libraries were created from 50 ng of the total RNA extracted from the xMD samples using the BIOO Scientific NEXTflex library preparation kit V3 for Illumina. Small RNAs were size selected using a PAGE gel. Eighteen cycles were performed for PCR amplification. Libraries were generated using the Perkin Elmer Sciclone G3 machine using the Maestro Software version 6.0.13.152 and following the Sciclone NGS and NGSx workstation automation guide for this kit, which is downloadable from www.biooscientific.com. Libraries were sequenced with 50 base pair single-end reads using the Illumina HiSeq 3000, in a single lane with no other samples. The Illumina Real Time Analysis (RTA) module was used to perform image analysis and base calling and the BCL Converter (CASAVA v1.8.2) was used to generate the sequence reads. A sequencing depth of greater than 31 million reads for each sample was obtained, (Table 1). Adapters and the first and last 4 bases were trimmed from the FASTQ files using Cutadapt version 1.14 40 .

Flow sorted colonic epithelial cells.
Colonic epithelial cells were obtained by the flow sorting (BD FACSAria II) of EpCAM+ cells through a modification of the protocol of Dalerba 42 . RNA was isolated using the miRNAeasy Mini kit. Small RNA-seq data from these cells were previously published in McCall et al. 4 and are in the Sequence Read Archive as SRR5127219.
Illumina TruSeq Small RNA-seq. Small RNA libraries of colonoids and flow sorted epithelium were prepared using the Illumina TruSeq Small RNA Library Preparation kit according to the manufacturer's protocol, as described 4 . Multiplexed sequencing was performed as single-read 50 bp, using rapid run mode and v2 chemistry on a HiSeq 2000 system (Illumina).
Additional small RNA-seq data. RPM data for additional primary cells was obtained from the NCBI BioProject PRJNA358331 (SRA accession numbers SRR5127200-36 and SRR5139121) of McCall et al. 4 . This includes the flow-sorted epithelial cell described in further detail above. Other specific comparison samples used below were SRR5127200 (lymphocyte), SRR5127232 (red blood cell), SRR5127228 (endothelial), and SRR5127231 (fibroblast).
Small RNA-seq data processing. All samples were processed using miRge 22 or miRge 2.0 21 , an updated version of miRge. Both miRge and miRge 2.0 utilized custom expression libraries to which collapsed reads are aligned using Bowtie 43 . To match BioProject PRJNA358331 data for the purposes of RUV normalization, we processed the four new xMD-miRNA-seq samples using the original miRge. miRNAs that were parts of repeat elements (ex. miR-5100) or coding regions of genes (ex. miR-7704), which were over-represented in this analysis, were removed. We also processed those four samples along with individual colonoid, macrophage, fibroblast, endothelial cell, red blood cell and flow-sorted epithelial cells with miRge 2.0. Using miRge 2.0, we utilized a MirGeneDB 2.0 human library containing 586 miRNA genes 23 , instead of a miRBase v21 library containing 2,588 miRNAs 24 , to focus on the most accurate list of miRNAs. Reads for each sample are reported as absolute read counts and RPM. Somewhat unique to both miRge versions, when two miRNAs are highly similar in that isomiR reads could align to either, the two (or more) miRNAs are clustered together (ex. let-7a-5p/7c-5p). Additionally, after miRge 2.0 processing, the data was hand-curated to look for anomalous miRNAs and five miRNAs (miR-328-3p, miR-34c-5p, miR-330-3p, miR-34a-5p, and miR-34c-5p) were removed from the xMD-miRNA-seq samples due to extreme skewing towards non-canonical miRNAs. As an example, in one specimen (xMD-Epi-2), miR-328-3p had only 25 canonical sequences, but had 6,495 non-canonical isomiR sequences indicating only 0.4% of reads were canonical and likely alignment of non-miR-328-3p sequences to this miRNA. A typical miRNA has >60% of reads being canonical 4 .
Remove Unwanted Variation (RUV) normalization. We utilized the remove unwanted variation algorithm 44 (RUVSeq R/BioC package version 1.8.0) to normalize 165 primary cell small RNA-seq runs from McCall et al. 4 and 4 xMD-miRNA-seq samples.

Assessing technical variation.
To compare miRNA length distribution differences, the length of each canonical and non-canonical (isomiR) sequence was determined for 5 samples: xMD-Epi-2, xMD-Epi-3, a flow collected epithelial cell sample (SRR5127219), a cultured primary endothelial cell sample (SRR5127228) and a cultured primary fibroblast cell sample (SRR5127231). These values were converted to the percent of each length using a histogram function in Excel.
To assess which miRNAs varied the most between technical replicates we took the minimal and maximal RPM values for each miRNA across the two technical replicate xMD-miRNA-seq samples (1 vs. 2 and 3 vs. 4). Then we compared the fold change between these two variables among the 90 miRNAs with a minimum average value of 1,000 RPM. To calculate a coefficient of variation, the same 90 miRNAs were used. These were plotted as coefficient of variation vs. log 2 RPM. There is again a trend for a higher coefficient of variation for the lower abundance miRNAs. However, there are no miRNAs that are particular outliers. This figure has been added to Supplementary Figure 1  Data Availability. The four small RNA-seq datasets on the xMD-Epithelial data and the one colonoid RNA-seq dataset have been submitted to the Sequence Read Archive as BioProject PRJNA445477 samples SRR6895202-5 and SRR6898464.