Cross-sectional study of human coding- and non-coding RNAs in progressive stages of Helicobacter pylori infection

Helicobacter pylori infects 4.4 billion individuals worldwide and is considered the most important etiologic agent for peptic ulcers and gastric cancer. Individual response to H. pylori infection is complex and depends on complex interactions between host and environmental factors. The pathway towards gastric cancer is a sequence of events known as Correa’s model of gastric carcinogenesis, a stepwise inflammatory process from normal mucosa to chronic-active gastritis, atrophy, metaplasia and gastric adenocarcinoma. This study examines gastric clinical specimens representing different steps of the Correa pathway with the aim of identifying the expression profiles of coding- and non-coding RNAs that may have a role in Correa’s model of gastric carcinogenesis. We screened for differentially expressed genes in gastric biopsies by employing RNAseq, microarrays and qRT-PCR. Here we provide a detailed description of the experiments, methods and results generated. The datasets may help other scientists and clinicians to find new clues to the pathogenesis of H. pylori and the mechanisms of progression of the infection to more severe gastric diseases. Data is available via ArrayExpress.


Background & Summary
Helicobacter pylori is one of the most successful human bacterial pathogens, infecting 4.4 billion individuals worldwide 1 . Infection can induce gastric pathologies ranging from chronic gastritis in all infected individuals to peptic ulcers (in 15-20% of patients) and gastric cancer (0.5-1% of patients) 2 .
Individual response to H. pylori infection is complex and depends on a combination of environmental factors, genetic background, host response and strain virulence 3 . The pathway towards gastric cancer is a sequence of events known as Correa's model of gastric carcinogenesis, a stepwise inflammatory process from chronic-active gastritis (CAG), atrophy (AT), intestinal metaplasia (IM) and gastric adenocarcinoma 4 .
This study examines gastric clinical specimens representing different steps of the Correa pathway with the aim of identifying the expression profiles of coding-and non-coding RNAs (microRNAs and small RNAs) that may have a role in Correa's model of gastric carcinogenesis and, potentially, to develop novel clinical biomarkers.
RNAseq (for microRNAs and non-coding RNAs) and microarrays (for coding RNAs) were used to screen for differentially expressed genes in gastric biopsies (antrum/corpus). The expression of a selection of genes was confirmed in a validation cohort of patients using quantitative real-time PCR (RT-qPCR). The general study design is illustrated in Fig. 1. Here we provide a detailed description of the experiments conducted, methods used and results generated. The datasets may help other scientists and clinicians to find new clues to the pathogenesis of 2 Scientific Data | (2020) 7:296 | https://doi.org/10.1038/s41597-020-00636-6 www.nature.com/scientificdata www.nature.com/scientificdata/ H. pylori and the mechanisms of progression to severe disease states. The transcriptomics data is available in the ArrayExpress database 5 .  www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ 70 °C. Then 5′-adapter was added alongside using a truncated T4-RNA ligase 2 (Cat. No. M0351S, New England Biolabs, MA, USA) in an incubation at 28 °C for 1 hour. Half of the ligation product was used for the reverse transcription performed with SuperScript II reverse transcriptase (Cat. No. 18064-014, ThermoFisher, MA, USA) in a thermocycler for 1 hour at 50 °C. Next, enrichment of the cDNA was performed using PCR cycling: 98 °C for 30 secs; 11 cycles of 98 °C for 10 secs, 60 °C for 30 secs and 72 °C for 15 secs; a final elongation of 72 °C for 10 mins, and pause at 4 °C. PCR products were resolved on 6% Novex TBE PAGE gels (Cat. No. EC6265BOX, ThermoFisher, MA, USA). microRNA and Small_Non-coding-RNA fragments between 145-160 and 200-300 bp respectively, were cut from the gel. microRNA and Small_Non-coding-RNA libraries were extracted from polyacrylamide gel with the MinElute gel extraction kit (Cat. No. 28604, Qiagen, Germany) using an adapted protocol, in which gel slices were dissolved in a diffusion buffer (0.5 M ammonium acetate; 10 mM magnesium acetate; 1 mM EDTA, pH 8.0; 0.1% SDS) overnight at room temperature plus 3 hours and 30 min at 50 °C. The libraries were visualized on an Agilent 2100 Bioanalyzer with the Agilent High Sensitivity DNA kit (Cat. No. G2938-90320, Agilent Technologies, Santa Clara, CA) and quantified using quantitative PCR with the Kappa Library Quantification Kit (Master Mix and DNA Standards, Cat.No. KK4824, Roche-Kappa, Basel, Switzerland). NGS Data analysis. Base calling was performed with the Illumina Real Time Analysis software (RTA, version 1.13.48) and the FASTQ files were generated with CASAVA (version: 1.8.1). Secondary data analysis was done using the sRNAbench package 8 . Briefly, reads were aligned to the human genome (UCSC hg19) using Bowtie 1.1.2 9 . miRNA annotations were obtained from miRBase 10 (version 21). Sequencing analysis was done by using the sRNAbench package 11 . Briefly, after adapter trimming and unique read grouping, reads were aligned to the human genome (UCSC hg19) using Bowtie 9 allowing for one mismatch. To provide annotations for RNA elements that mapped to the human genome, miRBase (version 21) for mature and pre-miRNA sequences was used and a matrix of counts were created. To process count and to identify differentially expressed miRNAs we use edgeR package 12 .Transcripts were considered differentially expressed provided their edgeR FDR-adjusted P value was < 0.05.

Methods
Quantitative PCR validation. Twenty-five RNAs were reanalyzed to validate 24 messenger RNAs and 12 miR-NAS. The RNAs used were a subset (n = 25) of the aliquots of the same RNA samples we used for sequencing and microarray analysis. Studied genes are summarized in Table 1. Quantitative PCR. Coding RNAs were amplified using predesigned PrimeTime 5' Nuclease Assays (IDT, Iowa, USA) (assay catalog numbers are in Table 1) and PremixExTaq Probe qPCR mastermix (Cat. No. RR390W; Takara, Japan). miRNAs were quantified using predesigned microRNA LNA PCR Primer sets (Exiqon, Denmark) and SensiMix SYBR Low-ROX Kit (Cat. No. QT625-05, Bioline, UK). Amplification was performed in duplicate on a QuantStudio 7 Flex Real-Time PCR System (Applied Biosystems, Foster City, CA, USA) using 384-well plates.
qPCR data analysis. The raw PCR data was exported from QuantStudio Real-Time PCR Software v1.2 (Applied Biosystems) onto a RDML 13 file and imported into LinRegPCR (v2016.1) 14 . LinRegPCR was used to determine PCR efficiencies (E) and to calculate the starting concentration per sample (N 0 ). First, the program determines the baseline fluorescence and performs baseline subtraction. Then a Window-of-Linearity for all PCR samples per amplicon is set and then the algorithm determines: the mean PCR efficiency per amplicon (E mean ), the quantification cycle (C q ) value per sample and the fluorescence threshold set to determine the C q (N q ). With these data, N 0 is calculated using N 0 = N q / (E mean ) Cq .

Data Records
Individual miRNA and small-RNA FASTQ files and a tab-delimited file for the processed microarray data have been deposited in the ArrayExpress public repository 5  www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation Quality control. Sample collection. In order to ensure the collection of biopsy tissue samples would provide high-quality results for microbiology, molecular analysis and histology, a two round biopsy protocol was followed. During the endoscopy, a first set of biopsy samples was collected for microbiological (in sterile saline) and molecular analysis (in RNAlater) and a second set were fixed in formalin for histopathological examination. By doing this, we ensured that formalin contamination of biopsy forceps did not interfere with the RUT and H. pylori culture. Histological examination was performed by a pathologist specialized in digestive diseases. In order to increase the total RNA yield and because intestinal metaplasia is typically present as small mucosal patches, we isolated RNA from two gastric biopsies per anatomical location. The reason is that the biopsy cores examined by the pathologist are different from the biopsy specimens used for molecular analysis. By using two biopsies, we were more confident that if the pathologist reported intestinal metaplasia in the histology specimens, intestinal metaplasia would also be present in the molecular biology cores. Additionally, two biopsies are the minimum recommended by the Updated Sydney System 18 .
RNA processing. Figure 2 shows the quality control procedures used in this study for RNA integrity, library preparation and sequencing.  Table 1. qPCR Primer assays used for mRNA and miRNA validation. a Primer assays targeting all splicing variants were chosen for validation purposes, and when possible, in the same exon where the Illumina probe was positioned. b PCR efficiency was calculated by LinRegPCR software. Using the raw qPCR data, the algorithm computes iteratively a Window-of-Linearity for a specific amplicon and calculates the C q and PCR efficiency for each individual reaction and amplicon. NA: not applicable.

GENE Symbol Refseq accession Detects all variants (a) Exon location (a) Mean PCR efficiency (b) Company and catalogue number
www.nature.com/scientificdata www.nature.com/scientificdata/ Gene expression validation by qPCR. We used LinRegPCR 14 for calculating individual and mean PCR efficiencies. Amplicons showed high PCR efficiencies, ranging from 1.78 to 1.91. PCR inhibition can be detected using individual PCR efficiency values. Samples showing PCR efficiencies greater than 5% of the PCR mean efficiency per amplicon were excluded. The algorithm also calculates N 0 . N 0 is the starting quantity of mRNA or miRNA  ILMN_1716651  SDHA  ILMN_1744210  TFF3  ILMN_1811387  WDR1  ILMN_1675844   KRT20  ILMN_2219867  MEG3  ILMN_2061435  MTTP  ILMN_1774742  MUC2  ILMN_2205622  POFUT1  ILMN_2276758   EIF4G2  ILMN_2380946  FUT9  ILMN_1878007  HIPK3  ILMN_1746941  HNF4G  ILMN_1743394  IL8  ILMN_1666733   CDX1  ILMN_1815619  CFH  ILMN_1810910  CPS1  ILMN_1792748  CREB1  ILMN_2334243  CXCR5  ILMN_2337928   ACTB  ILMN_2152131  AGPAT2  ILMN_1732176  ANXA13  ILMN_2412490  APOB  ILMN_1664024 C3 ILMN_1762260  were selected for validation of 25 RNA samples. Aliquots of the same RNA samples were used for sequencing, microarray and qPCR measurements. Raw qPCR data was exported to LinRegPCR software. N 0 (an estimate of the target starting concentration per reaction) was calculated using the formula N 0 = N q /E Cq where E is the amplicon PCR efficiency and N q is the fluorescence threshold set to determine C q . The Pearson correlation coefficient (R), the p-value and 95% confidence interval are indicated. Additional correlations to genes having multiple probes can be found in ref. 29 .
www.nature.com/scientificdata www.nature.com/scientificdata/ (expressed in arbitrary fluorescence units). Quantitative N 0 values have been used in previous publications [19][20][21][22][23][24] . Determining N 0 has several advantages over relative quantification. First, the selection of a housekeeping gene is often controversial since the expression of all genes is regulated. Second, the expression of a housekeeping gene varies to a greater or lesser extent under experimental conditions 25 . Third, to solve this issue a quantitative PCR approach with a correction factor according to the starting amount of RNA used in the reverse transcription has been recommended (i.e. µg of RNA) 26 instead of relative quantification.
To evaluate the concordance in gene expression between microarray or RNA-seq and qPCR, we calculated the correlation between normalized microarray/RNA-seq and qPCR log transformed N 0 values (Fig. 4). Overall, high R and low p-values values (R > 0.8, p < 0.001) were observed between microarray and qPCR measurements. Some of them were probe dependent (i.e. C3 probe ILMN_1762260: R = 0.79, p < 0.001, but C3 ILMN_1662523 was not correlated). Five miRNA showed high correlation (R > 0.7, p < 0.001), 4 were poorly correlated (R ∼ 0.4, p < 0.05) and 3 were not correlated.

Usage Notes
miRNA, small-RNA raw sequencing data (FASTQ) and normalized microarray data can be analysed by a variety of freely accessible packages and platforms, such as R/Bioconductor 27 . Some R/Bioconductor packages can be used without prior programming knowledge by using the Galaxy platform 28 .