Comprehensive RNA dataset of tissue and plasma from patients with esophageal cancer or precursor lesions

In the past decades, the incidence of esophageal adenocarcinoma has increased dramatically in Western populations. Better understanding of disease etiology along with the identification of novel prognostic and predictive biomarkers are urgently needed to improve the dismal survival probabilities. Here, we performed comprehensive RNA (coding and non-coding) profiling in various samples from 17 patients diagnosed with esophageal adenocarcinoma, high-grade dysplastic or non-dysplastic Barrett’s esophagus. Per patient, a blood plasma sample, and a healthy and disease esophageal tissue sample were included. In total, this comprehensive dataset consists of 102 sequenced libraries from 51 samples. Based on this data, 119 expression profiles are available for three biotypes, including miRNA (51), mRNA (51) and circRNA (17). This unique resource allows for discovery of novel biomarkers and disease mechanisms, comparison of tissue and liquid biopsy profiles, integration of coding and non-coding RNA patterns, and can serve as a validation dataset in other RNA landscaping studies. Moreover, structural RNA differences can be identified in this dataset, including protein coding mutations, fusion genes, and circular RNAs. Measurement(s) mRNA Sequencing • MicroRNA Sequencing Technology Type(s) sequencer Sample Characteristic - Organism Homo sapiens Sample Characteristic - Location Belgium Measurement(s) mRNA Sequencing • MicroRNA Sequencing Technology Type(s) sequencer Sample Characteristic - Organism Homo sapiens Sample Characteristic - Location Belgium


Background & Summary
Esophageal cancer is the sixth most common cause of cancer-related death worldwide 1 . The incidence of esophageal adenocarcinoma (EAC), a histological subtype of esophageal cancer, has rapidly increased in the Western world in the last decades 2 . Despite improved treatment strategies, the five-year survival rate remains unacceptably low (10-25%) 3,4 . The main risk factors to develop EAC are gastro-esophageal reflux disease (GERD), Barrett's esophagus, smoking and age above 50 years 5 . Barrett's esophagus is a known precursor lesion for EAC where the normal squamous mucosa of the esophagus is replaced by columnar intestinal epithelium triggered by chronical acid stress due to GERD. Specifically, GERD can cause progression from non-dysplastic Barrett's esophagus (NDB) through the stages of low-grade dysplasia (LGD) to high-grade dysplasia (HGD), and eventually to invasive EAC 6 .
Upper endoscopy is not the ideal screening method due to its invasiveness, relatively high cost and above all large incidence of aforementioned risk factors in the general population. Despite high resolution endoscopy and virtual imaging techniques, detecting dysplasia in a long segment of Barrett's esophagus remains challenging. Additionally, there is a low inter-observer agreement among pathologists in grading both low-and high-grade dysplasia, leading to over-and under-diagnosis 7,8 .
Mechanisms that drive EAC development remain poorly understood. The analysis of the transcriptomic landscape of EAC, HGD and NDB can provide further insights into molecular mechanisms involved in the development and progression of EAC. The study of RNA abundance profiles has proven its value to aid in the identification of new biomarkers to improve disease detection, therapeutic decision making, therapy response monitoring, and early relapse detection 9 .
Over the last decade, numerous studies have explored various types of RNA species in tissue biopsies from esophageal cancer patients. For instance, microRNAs (miRNAs) have been identified in tissue biopsies as potential biomarkers for EAC, HGD and NDB 10 . These miRNAs seem to have great potential as a diagnostic marker for Barrett's esophagus in a population at risk (patients with GERD), but further research is required to identify miRNAs for risk stratification. To a lesser extent, messenger RNA (mRNA) expression has been studied in EAC, HGD and NDB as well 11,12 . EAC is characterized by high mutation rates (including TP53 as a driver mutation that is most often found in tumor tissue 13 ). Moreover, EAC as well as Barrett's esophagus tissues are characterized by a large heterogeneity 14,15 . By gaining a deeper understanding in the different molecular subtypes, a more targeted treatment approach can be explored.
Besides gene dysregulation, chromosomal rearrangements can result in fusion proteins. Fusion genes have been reported to be involved in cancer 16 , including EAC [17][18][19] . Identification of fusion genes provides valuable insights in the development of EAC and can potentially be used as biomarkers for detection or therapeutic targeting.
Classically, these molecular profiling studies require the availability of (tumor) tissue that is not always readily available. The past decade, profiling of nucleic acids isolated from liquid biopsies (e.g. blood) for cancer biomarkers has gained increased interest, because this procedure is minimally invasive compared to tissue biopsies. For EAC, a number of studies have identified several miRNAs as putative biomarkers in serum or plasma 20,21 , but further clinical validation studies are needed prior to assessment of clinical utility. Circular RNA (circRNA) is an emerging new type of RNA that has gained interest in the field of cancer biomarker research. Due to their circular covalent structure, circRNAs are more resistant to degradation by exonucleases in the blood. Although the potential as cancer biomarker has been shown in several studies 22,23 , this has not yet been reported in either plasma or tissue from EAC patients.
Quantification of circulating mRNAs as a biomarker are much more challenging, due to their low concentration and fragmentation status in the blood. However, with the refinement of RNA sequencing methods, the detection of circulating mRNA is improving as well. In EAC these circulating mRNAs have not been identified yet, but have shown great potential in other cancer studies 24 .
In this study, we generated a comprehensive dataset that allows exploration of the complex transcriptome landscape of EAC and precursor lesions (HGD, NDB) in 17 patients. It includes polyA+ RNA (tissue samples), mRNA capture-based (plasma) and miRNA expression profiling (tissue and plasma). Exploratory data analysis was done to study protein coding gene mutations, fusion genes, and circRNAs.

Methods
Patient sample collection. Matching tissue and blood samples were obtained from four patients with esophageal adenocarcinoma (EAC), five patients with high-grade dysplasia (HGD) and eight patients with non-dysplastic Barrett's esophagus (NDB) ( Table 1). All samples were collected before treatment with informed consent (EC/2016-0495 and EC/2016-0496, Ghent University Hospital Ethics Committee). Tissue samples were obtained during endoscopy (NDB and HGD) or after surgical resection of the tumor (EAC). At least one of the tissue samples that was collected from the diseased tissue zone was sent for pathological investigation. The other disease tissue samples and healthy esopgahus tissue samples (collected from each patient) were preserved in RNAlater (Qiagen) at 4 °C and transferred to −80 °C the following day for long-term storage. Blood samples were collected in a 6 ml EDTA waste tube followed by a 9 ml sodium citrate (3.2%) VACUETTE blood tube (Greiner Bio-One). Plasma was prepared by centrifugation at 1,800 g for 10 min (full break and acceleration). The clear toplayer (leaving 0.5 cm above the buffy coat) was transferred to cryovials and stored at −80 °C. Time between blood collection and plasma preparation was less than 4 h, except for sample ID2 (6 h) and ID20 (7 h). Hemolysis was determined spectrophotometrically (absorbance at 414 nm) for all plasma samples using Nanodrop (ND1000, Thermo Scientific) (see Supplementary Table 1). RNA extraction, library preparation and sequencing of all samples was performed by Biogazelle (Zwijnaarde, Belgium) as discussed in the next sections. An experimental overview is shown in Fig. 1.

RNa isolation from tissue and plasma samples.
For all tissue samples, total RNA was isolated using the miRNeasy mini kit (Qiagen) with on-column DNase digestion, according to the manufacturer's protocol. RNA concentration was measured with the Qubit 2.0 fluorometer (Thermo Fisher Scientific). The concentration ranged from 16.3 to 2,210 ng/µl, with sample ID43_EAC (disease tissue) having the lowest concentration (Supplementary Table 2). RNA integrity was determined using the Fragment Analyzer (Advanced Analytical Technologies). Most samples (70.6%) had quality scores above 7, the lowest score was 3.4 (disease tissue of sample ID43_EAC) (Supplementary Table 2). RNA was used for polyA+ RNA sequencing and small RNA sequencing.
For all plasma samples, RNA was isolated from 200 µl plasma using the miRNeasy Serum/Plasma Kit (Qiagen) according to the manufacturer's instructions. For RNA used for mRNA capture sequencing, RNA isolation was followed by gDNA removal using the Heat&Run gDNA removal kit (ArcticZymes). Since extra-cellular RNA from plasma is highly fragmented and typically below the detection limit, the RNA concentration or integrity was not estimated. Instead, a volume based RNA input was used for library preparation.
www.nature.com/scientificdata www.nature.com/scientificdata/ PolyA+ RNA sequencing for tissue samples. Libraries were prepared with the TruSeq Stranded mRNA Library Prep kit (Illumina), using 100 ng of RNA as input material. The quality and the size distribution of the libraries was validated on the Fragment Analyzer (Advanced Analytical Technologies) and quantification was done using the Qubit fluorometer (Life Technologies). Libraries were normalized and samples were pooled accordingly. Samples were paired-end sequenced with a read length of 2 × 75 base pairs (bp) on a NextSeq 500 (Illumina) instrument according to the manufacturer's instructions. mRNA capture sequencing for plasma samples. Libraries     www.nature.com/scientificdata www.nature.com/scientificdata/ (Advanced Analytical Technologies) was used to validate size distribution and quality of the libraries and quantification was done using Qubit fluorometer (Life Technologies). Libraries were normalized and samples were pooled accordingly. Samples were paired-end sequenced with a read length of 2 × 75 bp on a NextSeq 500 (Illumina) instrument according to the manufacturer's instructions. Sequencing was done in two runs for all samples to obtain sufficient sequencing depth. For sample ID37_NDB, reads from only one run have been included, since the first run contained an insufficient number of reads (less than 2,000) for this sample.
Small RNA sequencing for tissue and plasma samples. Libraries were prepared using the NEBNext small RNA library prep kit (New England Biolabs) for both tissue and plasma samples. For tissue and for plasma, 100 ng and 6 µl of total RNA was used as input, respectively. Library size selection was done with the Pippin Prep system (Sage Science) to select the ~147-157 nt fragments containing mature miRNAs. Libraries were normalized based on qPCR quantification and pooled accordingly. Pools were concentrated with ethanol precipitation and quantification with the Qubit 2.0 fluorometer (Thermo Fisher Scientific). Tissue and plasma samples were single-end sequenced with a 75 bp read length on a NextSeq 500 (Illumina) instrument according to the manufacturer's instructions.
Data processing of mRNA sequencing data. Pre-processing of mRNA sequencing data of plasma and tissue samples included 3′-end trimming, adapter removal and filtering (discard reads smaller than 20 nt) using Cutadapt (v1.18). Low quality read pairs were removed using Biopython (v1.72) by keeping pairs with minimal 80% of their length having a Phred score greater or equal than 19. Clumpify (BBMap v38.26) was used for read duplicate removal for plasma samples only, due to the low RNA input. STAR (v2.6.0) was used for mapping (GRCh38 v91) and quantification was done with HTSeq (v0.11.0). Individual QC reports were generated with FastQC (v0.11.8) and multiQC (v1.8) was used to combine these reports for tissue and plasma samples. Annotation was based on GRCh38, UCSC Genome Browser (reference genome) and GENCODE v20, Ensembl 84 (reference transcriptome). The number of mapped reads remaining after the different pre-processing steps in tissue and plasma samples is shown in Table 2. The R packages edgeR (v3.28.1) and limma (v3.42.2) were used for normalization (Trimmed Mean of M-values) differential gene expression (tissue)/ abundance (plasma) analysis, respectively. Prior to these analyses, genes were filtered based on more than four counts in at least half of the samples per group (EAC, HGD, NDB). The Gene Set Enrichment Analysis (GSEA) tool (v4.1.0) was used to identify sets of genes that are significantly different between two groups 25 . As input for the analysis, a ranked list based on log2 fold change of all genes was used. For the purpose of this study, two collections of the Molecular Signatures Database (MSigDB) were used: the hallmark 26 and the C2 chemical and genetic perturbations gene sets.
Data processing of small RNA sequencing data. Adapter trimming was applied to all small RNA sequencing reads of tissue and plasma samples, followed by mapping to the GRCh38 reference genome with Bowtie (v1.2.2). No mismatches were allowed for mapping reads smaller than 25 nucleotides, while for the longer reads a maximum of two mismatches were allowed. Annotation was based on Ensembl (v84), UCSC (hg38) and miRBase (v21). Mapped reads were annotated to mature miRNAs as well as other small RNAs, including tRNA, rRNA, sn(o)RNAs and piRNAs. Here, we only present the miRNA results. The number of mapped reads remaining after the different pre-processing steps in tissue and plasma samples is shown in Table 2. The R packages edgeR (v3.28.1) and limma (v3.42.2) were used for normalization (Trimmed Mean of M-values) and differential miRNA expression (tissue)/abundance (plasma) analysis, respectively. Prior to these analyses, genes were filtered based on more than four counts in at least half of the samples per group (EAC, HGD, NDB). www.nature.com/scientificdata www.nature.com/scientificdata/ both Bowtie2 (v2.3.4.1) and Bowtie (v1.1.2) respectively. First, reads are aligned onto the genome and transcriptome using TopHat2 in order to reduce false positive reads aligned in the TopHat-Fusion alignment. BEDTools (v2.26.0) was used to convert BAM files to fastq files. The "parse", "annotate", "assemble" and "denovo" modules in CIRCexplorer2 were used according to the user's manual 27 . Variant analysis of mRNA capture sequencing data. RNA sequencing data can be used for variant analysis, as previously demonstrated 28 . Using the RNA sequencing data from tissue and plasma samples, variants were identified using the following pipeline (based on Piskol et al. 29 ): the first ten bases of all paired-end reads of each sample were trimmed due to possible false positives that can occur here as a result of random priming. The remaining sequence was aligned against the human reference genome build GRCh38 using STAR (v2.6.0c, two-step mode). Next, Mutect2 was used to call variants using default settings following the GATK (v3.8.0) best practices workflow, which included base-recalibration and duplicate removal with Picard (v.2.21.6) 30 . Variants located within four nucleotides of splice-junctions, in homopolymeric regions or regions overlapping other repeat types were removed. For each of the remaining variants, a BLAT (v3.5) analysis was performed to assess the quality of the reads contributing to the variant call 31 . This helped identify and filter out variants introduced by misaligned reads. Afterwards, variants were filtered differently depending on the tissue of origin. For healthy and tumor tissue samples, variants supported by at least 20 reads in total (DP > 20) and four reads for the alternative allele (AD > 4) were retained. In addition, variants found in more than one gnomAD 32 (v3.1) sample or having allele frequencies below 20 or above 80 percent were removed in the tissue data. Next, variants identified in the healthy tissue were subtracted from the tumor variant list to obtain a list of tumor-specific variants. In a last phase, the disease-specific variant list was intersected with a list of variants in plasma. These results were filtered to only keep variants that have a coverage of at least two reads.

Analysis of circRNAs in
Fusion gene analysis in polyA+ and mRNA capture sequencing data. Fusion gene analysis was done on all tissue (polyA+ sequencing data) and plasma samples (mRNA capture sequencing data). Adapter clipping and quality trimming from all sequencing reads was done using Trimmomatic (v0. 35). After 3′ quality trimming, fusion genes were detected using a pipeline based on the FusionCatcher methodology (v0.99.7c). Mapping to the reference genome (Ensembl release 84) was performed with STAR (v2.5.1b) using the 2-pass mode and duplicates were removed with Picard tools (v2.7). This analysis results in a list of candidate fusion genes with the presumed breakpoint ("fusion junction").

Data Records
This dataset includes mRNA and small RNA sequencing data from four patients with EAC, five patients with HGD and eight patients with NDB. For each patient, RNA from matching tissue (healthy esophagus and disease) and plasma was sequenced, resulting in 102 sequenced libraries from 51 samples. Clinical information of the 17 patients is available in Table 1, including age at diagnosis, tumor stage and/or Barrett's segment and follow-up information (if known). An overview of all available data and access information is provided in Table 3. For this publication, raw data was pre-processed using in-house optimized pipelines (Biogazelle and Ghent University), resulting in 119 expression profiles: 34 mRNA and 34 miRNA expression profiles from healthy and disease tissue samples, 17 mRNA and 17 miRNA expression profiles from plasma, and 17 circRNA expression profiles (based on mRNA sequencing data) from plasma. Count tables have been deposited in the ArrayExpress 33 database at EMBL-EBI. In addition, results from variant-and fusion gene analysis are available as supplementary tables (Supplementary Tables 4, 5).
All pre-processed mRNA, miRNA and circRNA expression data for tissue and plasma samples was also uploaded to the R2 Genomics Analysis and Visualization Platform (http://r2.amc.nl), an online genomics data visualization tool. The user-friendly web application allows rapid and easy visualization of the data, including gene expression analysis, gene correlation analysis and visualization of one or multiple genes.
All raw sequencing data (polyA+, mRNA capture, small RNA) is available through the European genome-phenome archive (EGA) under accession number EGAS00001004939 34 . Data requests can be made by contacting the Data Access Committee, as stated on the EGA information page of the study (https://ega-archive. org/studies/EGAS00001004939). A Data Transfer Agreement (DTA) and Data Access Agreement (DAA) will have to be signed in order for the data to be transferred (a template can be found in Supplementary File 1). The raw sequencing data available at EGA were not part of the peer-reviewed content of this manuscript.

technical Validation
Assessment of RNA sequencing quality. mRNA sequencing quality. The mean sequencing quality per base (raw data) for mRNA tissue and plasma is higher than 28 for all samples (Fig. 2a), reflecting the very good quality of the data. The average number of reads for mRNA tissue and plasma samples throughout the pre-processing steps is shown in Table 2. For all tissue samples, 19-25 million reads per sample remain after trimming and filtering, except for sample ID40_NDB (disease tissue) that has a slightly lower number of reads (14.5 million). For the plasma samples, on average 3.2 million reads remain after filtering, trimming and deduplication.
For further downstream analyses, sample ID40_NDB was excluded due to the lower library yield (measured as described above) of the disease tissue sample (data not shown) and the lower percentage (68%) of reads with a quality score higher than 30, compared to all other tissue samples (85% on average). Sample ID43_EAC was also excluded for downstream analyses, due to the lower library quality of the disease tissue sample. This was likely due to the low concentration (16.3 ng/µl) and low RNA quality score (3.4) (Supplementary Table 2). (2022) 9:86 | https://doi.org/10.1038/s41597-022-01176-x www.nature.com/scientificdata www.nature.com/scientificdata/ small RNA sequencing quality. The mean sequencing quality per base (raw data) of the small RNA sequencing data (tissue and plasma) is higher than 28 for the first 60 bp in all samples (Fig. 2a), reflecting the very good quality of the data. The sequencing quality for samples ID26_HGD (healthy tissue) and ID19_NDB (disease tissue) decreases slightly towards the end of the reads (>60 bp). However, as most small RNAs are typically around 20-30 nucleotides in length, a good quality measure for the first 30 nucleotides of the 5′-end of the read is more relevant in the context of small RNA expression analysis. The number of remaining miRNA reads per sample after pre-processing is 5-10 million reads for tissue samples and 1-3 million for plasma samples ( Table 2).

Successful detection of thousands of RNa genes in tissue and plasma. Expressed mRNAs, miR-
NAs and circRNAs have been identified in all tissue and/or plasma samples ( Table 4). As expected, fewer unique mRNAs and miRNAs were found in plasma compared to tissue samples. In EAC samples, fewer unique circRNAs were found (353-1,165) compared to HGD (858-3,624) and NDB (1,237-3,683).

Validation of mRNa abundance data. mRNA in tissue.
Several studies have reported lists of differentially expressed genes in EAC, HGD and NDB compared to healthy tissue samples 11,12,35 . However, the overlap among these reported genes is limited. Tables 5 and 6 show the overlap of differentially expressed genes (adjusted p-value < 0.05) between EAC and healthy tissue from three large studies 11,12,35 and our own dataset. A significant overlap (Fisher's exact test; Benjamini-Hochberg adjusted p-value < 0.05) was observed between the differentially expressed genes reported in this study and the three published gene sets.
GSEA in tissue revealed several interesting gene sets that are enriched in disease tissue (EAC or NDB) compared to healthy tissue, and EAC compared to NDB tissue (Supplementary Table 3). For example, comparing EAC tissue with healthy tissue the following relevant gene sets were significantly (FDR < 1%) positively enriched in EAC: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION, HALLMARK_KRAS_SIGNALING_ UP and WANG_ESOPHAGUS_CANCER_VS_NORMAL_UP 35 . Comparing EAC with NDB tissue samples, the WANG_BARRETTS_ESOPHAGUS_UP 35 gene set was significantly negatively enriched in EAC (FDR < 1%). These GSEA results (FDR < 25%) are available in Supplementary Table 3.

mRNA in plasma.
There are currently no studies reporting on mRNAs in plasma of patients with EAC, HGD or NDB. Using the sample clustering option in R2 for the plasma mRNA expression level data, a clear clustering of the samples according to sample identity, i.e. EAC samples versus HGD and NDB samples (Fig. 2b) is observed. If we look into more detail we observe that some of the differentially expressed mRNAs in tissue of patients with EAC compared to NDB are also differentially abundant in the plasma samples (in the same direction). More specifically, there is an overlap of 11 up-and 24 downregulated genes, as shown in the heatmap in Fig. 2c.
When comparing EAC with NDB plasma, several relevant gene sets showed positive enrichment in EAC, including HALLMARK_MYC_TARGETS V1 and V2 (FDR < 1%). Deregulation of MYC is known to play a key role in the development of EAC 36,37 , indicating that tumor signal may be present in plasma. These GSEA results (FDR < 25%) are available in Supplementary Table 3.
Markers for epithelial mesenchymal transition (EMT) are of clinical relevance for a more targeted treatment 38 . The process of EMT enables cancer cells to enter the blood stream and form local and distant metastasis 39 . Several EMT markers have been identified in EAC as well as in precursor lesions (NDB) 40,41 , suggesting that this process could be an early event for progression to EAC. Importantly, ZEB1 is a gene involved in EMT 42,43   www.nature.com/scientificdata www.nature.com/scientificdata/ and in this data it was found to be significantly higher in EAC compared to NDB in both tissue and plasma (Benjamini-Hochberg adjusted p-values are 2.62 × 10 −2 and 3.01 × 10 −2 , respectively).
Maag et al. 11 Lv et al. 12 Wang et al. 35 tissue data from this study (including all 34 samples) Maag et al. 11   Lv et al. 12 Wang et al. 35 tissue data from this manuscript (including all 34 samples) Lv et al. 12 57 Wang et al. 35   www.nature.com/scientificdata www.nature.com/scientificdata/ blood fractions, including serum 21,55-59 , plasma 54 and extracellular vesicles 53 were studied. With our analysis pipeline, no differentially abundant miRNAs between the plasma samples of the different groups were identified (Table 7).

Usage Notes
Gene expression and abundance analysis. Differential gene expression and abundance analyses were performed for mRNAs, miRNAs and circRNAs in tissue and plasma. The number of differentially expressed genes are depicted in Table 7. The pre-processed data is also uploaded in R2, allowing further exploration and visualization of the dataset. In this study, we have identified several circRNAs in plasma of patients with EAC, HGD and NDB. This type of RNA has great potential as circulating biomarker because they are more resistant to RNA degradation by exonucleases due to their circular structure. While we focused on miRNA expression and abundance analyses using the small RNA sequencing data, other small RNAs such as tRNA (fragments), and piRNAs could be analyzed using our data as well.
Expression of related miRNAs and mRNAs. One of the unique features of our dataset is the inclusion of both miRNA and mRNA data of matching disease and healthy tissue samples. The relationship between miRNA and mRNA expression can thus be studied in our data. As an example, the hedgehog (HH) signaling pathway is known to play an important role in EAC and NDB 60 . In NDB, increased expression of hsa-miR-194 results in a loss of SUFU, which leads to an upregulation of the Sonic Hegdgehog (SHH) gene. The upregulation of hsa-miR-194 and SHH, and downregulation of SUFU compared to healthy tissue is also observed in our NDB tissue data as well as in the EAC and HGD tissue samples (Figs. 2e and 3). These unique matched disease and healthy fractions dataset allows further exploration of potentially relevant pathways, i.e. by using both miRNA and mRNA data, as demonstrated by this example.   www.nature.com/scientificdata www.nature.com/scientificdata/ Mutation analysis. Based on the polyA+ sequencing data (tissue) and mRNA capture sequencing data (plasma), mutation analysis was performed. For each patient, disease specific variants were identified using strict filtering as described in the methods section. Subsequently, these variants were intersected with variants in plasma. In total, 24 variants were identified in the plasma of two EAC patients, five HGD patients and four NDB patients (Supplementary Table 4). Per patient, 1-7 variants were found, but no overlap was observed within a disease group or between groups. Three variants are known tumor mutations according to the COSMIC database in prostate cancer (COSM5564582), cervix or biliary tract cancer (COSM5493837), or large intestine cancer (COSM5756079). These results are a proof-of-concept to demonstrate the ability to identify likely somatic mutations or disease-specific RNA-editing events in plasma RNA sequencing data.
Fusion gene analysis. Fusion gene analysis in EAC tissue has been reported in only a few studies [17][18][19] . Here, we demonstrate the potential of detecting fusion genes for EAC, HGD and NDB tissue and plasma samples. Results obtained from these analyses are provided in Supplementary Table 5. Results in this table are unfiltered, but in red are the fusion genes that have a high probability of being a false positive. In tissue samples, potential fusion genes were identified in all samples. By excluding (on a per sample basis) fusion genes also found in the healthy tissue, disease-specific fusion genes were identified. As a result, for all samples 2-14 fusion genes remain (excluding the potential false positives). For the plasma samples, potential fusion genes are identified in one HGD patient sample and in two NDB patient samples, with two overlapping fusion genes (ID5_HGD and ID19_NDB). No overlapping fusion gene between disease tissue and plasma samples was observed. Further validation of these potentially relevant fusion genes is required.

Code availability
All code used for pre-processing mRNA and miRNA sequencing data is publicly available on GitHub (https:// github.com/OncoRNALab/exRNAQC/blob/main/Preprocessing) 61 . For circRNA detection, the CircExplorer2 manual was followed as described in the Methods section. Further downstream analyses (differential expression, GSEA, fusion gene detection, and variant analysis) was done following the guidelines of the different R packages and software tools as described (with the used versions) in the Methods section.