Extracellular circular RNA profiles in plasma and urine of healthy, male college athletes

Circular RNA (circRNA) are a recently discovered class of RNA characterized by a covalently-bonded back-splice junction. As circRNAs are inherently more stable than other RNA species, they may be detected extracellularly in peripheral biofluids and provide novel biomarkers. While circRNA have been identified previously in peripheral biofluids, there are few datasets for circRNA junctions from healthy controls. We collected 134 plasma and 114 urine samples from 54 healthy, male college athlete volunteers, and used RNASeq to determine circRNA content. The intersection of six bioinformatic tools identified 965 high-confidence, characteristic circRNA junctions in plasma and 72 in urine. Highly-expressed circRNA junctions were validated by qRT-PCR. Longitudinal samples were collected from a subset, demonstrating circRNA expression was stable over time. Lastly, the ratio of circular to linear transcripts was higher in plasma than urine. This study provides a valuable resource for characterization of circRNA in plasma and urine from healthy volunteers, one that can be developed and reassessed as researchers probe the circRNA contents of biofluids across physiological changes and disease states.


Background and Summary
The advent of next-generation sequencing has spurred the discovery of a growing list of RNA biotypes, many of which are detectable across species, detected in numerous biofluids, and have biological function. While many studies have focused on microRNAs (miRNA), several other small RNA species (e.g. piwi-interacting RNAs (piRNA), tRNA fragments, and Y RNA fragments have been detected across a range of biofluids and are being developed as clinical biomarkers [1][2][3][4] . In addition to these linear RNAs, the discovery and detection of circular RNAs (circRNA), those with a covalently closed loop structure, have gained attention.
CircRNAs were initially discovered by electron microscopy, in the 1970s, as viroid molecules 5 . Nearly two decades later, circRNA were identified for a handful of mammalian genes [6][7][8] . Though initially thought to be rare splicing events, circRNAs have recently been identified as an abundant, endogenous RNA species in a number of organisms from Archaea to yeast, plants, worms, flies, fish, and mammals [9][10][11] . Additionally, circRNAs are abundantly expressed in a number of human tissues and cell types, and circRNA expression changes during development, and as a response to extrinsic factors such as stress, immune response, and hormonal stimuli [12][13][14][15][16][17] . These endogenous RNAs are characterized by their circular structures, which are formed by a back-splicing event that covalently links the 3′ "tail" splice donor with the upstream 5′ splice acceptor "head" of the transcript, forming a back-spliced, or "head-to-tail" junction. While circRNA function is still being elucidated, there are examples of circRNA inhibiting microRNA, regulating alternative splicing, and modulating the expression of parental genes [18][19][20][21][22] .
In comparison to their linear counterparts, circRNA transcripts can be more abundant and have greater stability as they are resistant to linear decay mechanisms and do not contain 5′-3′ polarity nor polyadenylated tails 14,21,23,24 , suggesting feasibility as stable biomarkers. CircRNA stability and detection in biofluids, saliva 25 , blood 24,[26][27][28][29][30] , and urine [31][32][33] , comes, in part, from their being protected in extracellular vesicles 28,[34][35][36] . Changes in circRNA expression is altered in multiple diseases, including preeclampsia, glioblastoma and colorectal RNA isolation, library preparation, and sequencing. For 51,52 . Reads were mapped to the genome with the recommended aligner and alignment parameters for each program: STAR v2.4.0j for DCC and CIRCexplorer, bowtie2 v2.2.1 for find_circ and KNIFE, bowtie v0.12.9 for MapSplice2, and bwa v0.7.13 for CIRI2. CircRNA prediction was then completed with the suggested parameters for each program, with the exception of incorporating a minimum 18nt overlap on either side of the junction. CircRNAs were kept for downstream analysis if they 1) had 2 or more junction counts and 2) were identified in at least 5 samples for each respective program. www.nature.com/scientificdata www.nature.com/scientificdata/ for comparison. BED12 GRCh37 RefSeq gene annotation files were obtained from UCSC (http://genome.ucsc. edu/cgi-bin/hgTables), and bedtools v2.26.0 was used to infer genes from reported backsplice junction genome locations 53 . Data were analyzed using the R v3.3.2 statistical package (https://cran.r-project.org). UpSet plots were generated using the UpSetR v1.3.3 package 54 . www.nature.com/scientificdata www.nature.com/scientificdata/ Quantification of circRNA expression. CircRNA count expression data was obtained from each respective bioinformatic program. Junction reads per million (JRPM) were calculated according to the total number of junction reads found in each sample as identified by STAR (both canonical and chimeric); therefore, JRPM = (cir-cRNA count/junction reads) * 1,000,000. The circular-to-linear ratio (CLR) for each circRNA was calculated as described previously 13,27 , by counting the linear spliced reads identified by STAR on the 5′ and 3′ flanks of  Table 2. CircRNA program characteristics. *The latest version of CIRCexplorer now supports paired-end reads. **CIRCexplorer and DCC use the STAR chimeric junctions output, so the junction overlap for these tools is set by the splice junction parameters during STAR alignment. ***The default minimum seed length (k) for bwa mem is 19 nucleotides. www.nature.com/scientificdata www.nature.com/scientificdata/ each circRNA junction, and dividing the back-spliced read count by the flank with the highest count; therefore, CLR = circRNA count/max (5′ linear junction count, 3′ linear junction count). In order to avoid division by zero, if no linearly spliced reads were detected, a pseudo count of 1 was added to the denominator. The number of reads assigned to the transcriptome was calculated using featureCounts (subread v1.5.1) with the Ensembl75 gene annotation 55 . Differential expression analysis was performed using DESeq. 2 v1.14.1 56 , after filtering to select samples which had detected at least 300 circRNA/sample as well as exclusion of circRNA that were expressed in less than 50% of samples.
DNA isolation and qRT-PCR. After centrifugation of blood samples, DNA was isolated from the buffy coat using the DNeasy Kit (Qiagen, Cat. No.: 69504). Previously isolated RNA from samples matching those used for library prep were selected for cDNA synthesis. cDNA was synthesized with random hexamers using the SuperScript III First-Strand Synthesis System for RT-PCR following manufacturer's protocols (Invitrogen,  Detected in 40% of samples 538 28 24 Detected in 50% of samples 395 16 16 Detected in 60% of samples 273 14 11 Detected in 70% of samples 177 10 10 Detected in 80% of samples 68 4 2 Detected in 90% of samples 15 2 1 Detected in 100% of samples 0 0 0 www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
CircRNA set size and genomic alignment. The set size (all circRNA in any sample by one tool) ranges from 1,835 to 7,462 and 163 to 1,349 in plasma and urine, respectively (Table 3). 965 and 72 circRNA were detected across all six tools in plasma and urine, respectively (Fig. 2a,c, red bars; Table 4; full list in figshare File 1 and 2 61 ). KNIFE predicted the most circRNA per sample in plasma and urine, while MapSplice predicted the fewest (Table 3). Table 5 displays the correlations between all of the tools, CIRCexplorer and DCC had the highest correlation. 85% (61 of the 72) of the circRNAs found in urine were also detected in plasma (Table 4). Figure 2b (plasma) and 2d (urine) display the number of detected circRNAs and the number that span introns, exons, and UTRs for both plasma and urine. The majority of circRNA identified in plasma and urine contain at least two exons and span an intron; 671 in plasma and 52 in urine; green bars (Fig. 2b, plasma and 2d, urine). A small number of circRNA are transcribed from a single exon (15 in plasma and 2 in urine).  www.nature.com/scientificdata www.nature.com/scientificdata/ Highly expressed, back-spliced junctions were validated by qRT-PCR. In order to validate predicted back-spliced junctions by qRT-PCR, we designed inward-facing primers for the 15 most highly expressed circRNA in each biofluid and tested each primer pair in samples from 10 different individuals, using the same source RNA for cDNA synthesis that was used for RNAseq (Fig. 3a,b). Figure 3a shows that the 15 circRNAs are detected in most of the 10 plasma samples. The numbers of samples are described in Table 6, and compared with the RNASeq detection for those circRNAs in the same samples. 13 primer pairs were validated in urine. Detection in urine samples was sparse, with fewer samples positive for each circRNA than for plasma ( Fig. 3b and Table 6). For the two back-spliced junctions detected in RNASeq data, but not validated by qRT-PCR in urine (circMYO5B and circPHC3), it is possible that the circRNA primers did not work, or there were qPCR inhibitors in the sample, or the circRNA was not present. Two of the samples did not have enough assigned reads via RNASeq to be included, so the total number of samples was 8. In order to rule out chimeric junctions that might be present in DNA or resemble artifacts introduced during library preparation, we also used genomic DNA (gDNA) from each individual as a negative control. All 15 primer pairs used in the plasma and urine samples were not detected in gDNA (data not shown). Table 7 describes the rank from highest to lowest expression for each of the circRNA validated by qRT-PCR, and compares it with the expression detected with sequencing. Their ranks do not correlate well between the two platforms.
Circular-to-linear RNA ratios. While the overall expression of most circRNAs is low compared to their linear counterparts, there are a number of circular RNA transcripts that have been described as more abundant than their linear host, cellularly as well as extracellularly 23,27,62,63 . We examined the circular-to-linear ratio (CLR) of circRNA transcripts found in plasma and urine as described previously; by taking the ratio of the circular, back-spliced junction counts compared to the linear count of the nearest 5′ or 3′ splice junction 13,24,27 . On average, 28.5% of circRNA transcripts in plasma and 21.5% of circRNA transcripts in urine have higher expression than their linear host gene (Fig. 3c, plasma and 3d, urine). Extracellular RNA is often fragmented and may have a 3′ www.nature.com/scientificdata www.nature.com/scientificdata/ bias 64 . Before examining the expression of circular RNA in relation to their host genes, we calculated the overall 5′ to 3′ coverage of linear transcripts and did not find a bias in our samples.
Participants sequenced 5 or more times have less inter-sample variation. A notable feature of this dataset is that many participants were sampled longitudinally, allowing for analysis of circRNA stability in individuals versus the entire dataset. Figure 4a,b show longitudinal circRNA expression in the same participants in plasma and urine, respectively. Broadly speaking, the heatmaps demonstrate similar expression patterners in the same participant over time. In order to assess variability within individuals, we calculated the coefficient of variation (CV) of circRNA expression, normalized to junction reads per million (JRPM). Here, we focus on participants sampled on 5 or more occasions over approximately one year. In both plasma and urine, the CV for each individual participant is displayed along with the CV for all participant samples. The data indicate that individuals have a statistically-significant consistency in circRNA expression pattern over time (Fig. 4c, plasma and 4d, urine).

Usage Notes
As the approach to detecting circRNA from RNA-Seq data differ with available tools, we employed 6 different bioinformatic tools: CIRCexplorer, CIRI2, DCC, KNIFE, find_circ, and MapSplice, in two clinically relevant biofluids, plasma and urine, using 134 and 114 samples, respectively. Most of these circRNA pipelines use an external aligner, such as bowtie, STAR, or bwa, to align reads to the genome and/or transcriptome (Table 2). After alignment, reads that contiguously align to the genome and/or transcriptome are filtered out, and the remaining unmapped reads are further filtered to identify back-spliced junctions. Differences in circRNA identification algorithms include: 1) how paired-end reads and gene annotations are used, if at all, 2) the amount of overlap www.nature.com/scientificdata www.nature.com/scientificdata/ over the junction that a read must contain, 3) the types of junctions considered, and 4) various filtering steps (Table 1) 65 . We sought to generate a high confidence set of circRNA expressed in plasma and urine with the following requirements for each circRNA: 1) detection in at least 5 samples for each respective biofluid, 2) a minimum 18 nt overlap on either side of the junction, 3) at least two reads spanning the back-spliced junction, and 4) identification by all 6 tested bioinformatic tools as identification can vary widely between tools [66][67][68] .
We tested alignment parameters and their influence on the detection rate of circRNA and found that the number of input reads, genome mapped reads, and junction reads did not correlate well with the number of circRNA detected per sample; rather the number of reads assigned to the transcriptome had the greatest correlation with the number of circRNA (R 2 = 0.805; data not shown).

Code availability
Code used for circRNA identification is available in the Supplemental data. Software versions used for analysis are as follows: STAR v2. 4