Multi-year whole-blood transcriptome data for the study of onset and progression of Parkinson’s Disease

Parkinson’s disease (PD) is an age-related, chronic and progressive neurodegenerative disorder characterized by a loss of multifocal neurons, resulting in both non-motor and motor symptoms. While several genetic and environmental contributory risk factors have been identified, more exact methods for diagnosing and assessing prognosis of PD have yet to be established. Here we describe the generation and validation of a dataset comprising whole-blood transcriptomes originally intended for use in detection of blood biomarkers and transcriptomic network changes indicative of PD. Whole-blood samples extracted from both early-stage PD patients and healthy controls were sequenced using no-amplification non-tagging cap analysis of gene expression (nAnT-iCAGE) to analyse differences in global RNA expression patterns across the conditions. Subsequent sampling of a subset of PD patients one-year later provides the opportunity to study changes in transcriptomes arising due to disease progression.

biomarkers by LC-MS/MS opens up the possibility of diagnosis from blood samples, even in the early stages of PD. Further to this, we decided to investigate whether yet more PD biomarkers could be found within the blood transcriptome. In fact, Infante et al. previously reported differences in transcriptomic expression between LRRK2 G2019S mutated patients and idiopathic PD patients by RNA-seq 17,18 , indicating the potential of such an analysis for highlighting differences between PD subtypes and between PD and healthy controls.
For this study, we collected whole blood samples from PD patients at an early stage of disease progression and healthy controls, with an aim to identify potent transcriptomic biomarkers at high resolution using an unbiased analysis method. Specifically, we utilized the no-amplification non-tagging cap analysis of gene expression (nAnT-iCAGE) protocol 19 to capitalize on the strengths of CAGE-sequencing, namely the ability to determine the RNA expression level of both known and unknown transcripts and the transcription start site (TSS) utilized, as well as prediction of promoter regions 20 . CAGE-sequencing also limits potential bias-generating steps introduced in the sample preparation of other sequencing methodologies. With the nAnT-iCAGE sequencing protocol in particular there is no need for PCR amplification, which is commonly carried out prior to sequencing and requires post-sequencing computational cleanup to mitigate bias introduction. Further, nAnT-iCAGE avoids the poly-A based enrichment that was carried out in previous transcriptomic analyses of blood in Parkinson's disease 17,18 . As a result, with this dataset it is possible to quantify and analyse important non-polyadenylated transcripts, such as bidirectionally transcribed enhancer RNAs 21 .
The samples collected and described in this paper include 39 PD and 20 control whole blood transcriptome samples 22 . These samples focus only on early stage PD, but encompass a range of ages and genders of participants, as well as differences in clinical scores (Table 1), and thus may account for some of the heterogeneity seen across PD patients. Additionally, the samples described here were collected over two years, thus allowing for some analysis of disease progression within early PD, in addition to highlighting differences between control and disease conditions.

Blood sample collection. PD was diagnosed according to the Movement Disorders Society Clinical
Diagnostic Criteria for Parkinson's disease 23 . Blood samples were collected from 87 PD and 10 control patients in the first year of the study (Y1) and from 67 PD (continuing from Y1) and 10 control patients in the second year (Y2; see Fig. 1a for flow diagram). All blood was collected and immediately stored at −80 °C in PAXgene Blood RNA tubes (PreAnalytiX). From the initial set of PD samples, 30 were pre-selected for RNA extraction (Fig. 1, step 2) on the basis of the following criteria: non-smokers, no significant previous disease, early stage of disease progression (one or two on the Hoehn & Yahr (H&Y) 24 scale) and a duration since disease onset of 1-3 years. Going into Y2, 12 of the sequenced Y1 patients remained in the study. Five replacement patient samples were chosen for sequencing from the remaining pool of 67 samples collected in Y2. The initial criteria were relaxed to allow a duration since onset of up to four years, though these new samples were still required to be low on the H&Y scale. Both the stored blood samples from Y1 and the newly collected Y2 samples for these five patients were sequenced along with the other 12 Y2 samples. The use of human blood was approved by the ethics evaluation committee of Juntendo University (Approval Number: 15-104) and the ethics review committee of RIKEN (H26-27). Informed consent was obtained from each participant. CAGE library preparation. RNA was extracted from blood samples using the PAXgene Blood miRNA kit (PreAnalytiX). Following RNA extraction, samples with low quality scores (<6.5 RIN) or low concentrations of RNA (<4.5 μg) were removed (see Technical Validation), leaving 22 PD and 10 control samples in Y1, and 17 PD and 10 control samples in Y2 (Fig. 1a, step 3). It is well documented that the blood transcriptome is highly saturated by globin RNAs 25 (predominantly alpha and beta haemoglobins), which has a masking effect on the remaining lower abundance transcripts. To limit this effect, samples remaining after the RNA isolation step were depleted of haemoglobins using the GLOBINclear TM kit (Thermo Fisher Scientific). nAnT-iCAGE libraries were prepared following the protocol described in Murata et al. 19 . Briefly, 3 ng of total RNAs were used for the synthesis of cDNA with random primers. The cDNAs with an intact 5′-end were captured by streptavidin-coated magnetic beads, ligated to a 5′ linker containing a barcode sequence and further ligated to a 3′ linker. A second strand was synthesized to generate the final dsDNA product used for sequencing. CAGE libraries were sequenced with the 50 bp single-end mode on the Illumina HiSeq 2500 platform.
Read alignment and processing steps. Raw sequencing files available from above 22 require processing before data analysis, and what follows is a brief description of the steps involved. Multiplexed reads should be split by barcodes, and ribosomal RNAs removed using rRNAdust v1.06 (in-house scripts, see Code availability). General quality of the FastQ files can be assessed per sample using FastQC 26 (see Technical validation). The extracted CAGE tags can then be aligned to the current human reference genome (hg38) using a number of aligners (here STAR 27 version 2.5.0a was used; see Technical validation). A genome-wide transcription start site (TSS) map of single-nucleotide resolution can be generated from the 5′ coordinates of the CAGE tags, which can then be used to define distinct TSS peaks (for instance using Paraclu 28 ). Note, the CAGE protocol is known to introduce an additional G nucleotide to the 5′ end of the CAGE tag, so a transformation algorithm must be used to correct for this systematic G-addition bias (see Code availability).

Data Records
All raw nAnT-iCAGE sequencing data (FASTQ files, samples 1-64 corresponding to Y1 and Y2 data) as well as sample metadata are available through the NBDC human database 30  www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation RNA quality control. Extracted RNA was analysed using the Agilent 2100 bioanalyzer, assessing quality and concentration of intact RNA to determine suitability for subsequent sequencing. Example high quality outputs  www.nature.com/scientificdata www.nature.com/scientificdata/ for Y1 control (Fig. 1b) and PD (Fig. 1c) samples are shown. Only samples with a concentration in excess of 4.5 ug and RNA integrity number of 7 or higher were selected for CAGE sequencing.
Read quality and accurate base-calling. FastQC was used to assess the quality of the sequenced reads on a per sample basis, with a focus on the per base sequence quality. FastQC looks at the Phred quality score, calculated by comparing read signals to the probability of accurate base-reading. Phred scores are related to base-calling error probabilities in a logarithmic manner (Q = −10 log 10 P), such that scores of 50, 40 and 30 indicate base call accuracies of 99.999%, 99.99% and 99.9% respectively. An example Y1 control sample is shown in Fig. 2a, with an aggregated plot of all Y1 samples generated using MultiQC 31 shown in Fig. 2b. Though the Phred scores at the majority of the base positions were high, indicating high accuracy in the assigned base at the given nucleotides, the final base at position 48 had a very low average score of 2. Trimming the reads to exceed a mean Phred score of 30 can be easily carried out (for instance using the FASTQ Quality Trimmer from FASTX 32 ), creating a set of sequences that are of high quality and unambiguous in nature. Trimming in this manner introduces variation in the sequence length, though the majority of reads are over 45 bases in length (Fig. 2c,d) indicating minimal loss from the original 48 base length. CAGE quality control. The GLOBINclear TM kit successfully depleted the Y1 samples of haemoglobins, with the remaining globin mRNAs accounting for around 5.1% of the total sequenced tags (Fig. 3a). The proportion of globin tags in the Y2 sequenced samples was higher, averaging 29.1% of total tag counts, indicating the globin depletion was not as efficient (Fig. 3a). Many of these tags cannot be unambiguously aligned to the genome (so-called multimappers), and thus can be easily removed before downstream analyses. In general, we obtained a high rate of CAGE tags mapping unambiguously to the hg38 human genome using STAR, with an average MAPQ10 count of 5.9 million tags across the two sequencing batches (Fig. 3b). Coupled with the depletion of globin RNAs, this indicates a high-quality set of sequencing samples that can be used for blood transcriptomic analysis of early PD. Furthermore, the samples show a high degree of consensus with FANTOM5 promoters, with an average of 77.8% of all tags overlapping promoter regions (Fig. 3c). One important caveat to make note of is that one of the Y1 control samples had a lower sequencing depth, with total MAPQ10 counts of less than 1 million (Fig. 3b). Despite the fact that this sample clusters separately from the remainder of the control samples, it is still highly similar. For instance, the number of detected promoters for this sample is only slightly reduced compared with the overall promoter mapping rate for control samples (75.85% of FANTOM5 promoters versus 79.79 ± 2.8%; Fig. 3c), showing that the majority of promoters expressed in other control samples are also expressed here. www.nature.com/scientificdata www.nature.com/scientificdata/

Code availability
A number of in-house scripts are commonly used for the processing of the raw FASTQ files before alignment as well as for correction of the CAGE specific sequencing bias mentioned above (and described in more detail in supplementary note 3-e of Carninci et al. 20 ). A brief description of these scripts follows: splitByBarcode is used to split multiplexed sequences into constituent sample FASTQ files and can be found in the MOIRAI system 29 ; rRNAdust removes all sequences that match to known rRNA sequences with two or fewer errors and is freely available through the FANTOM5 website (http://fantom.gsc.riken.jp/5/suppl/rRNAdust/); starbam2gcorrectedctss is a shell script used to convert BAM files to CTSS bed files, correcting for any additional Gs at the 5′ end, and is available upon request.

acknowledgements
We would like to thank the Genome Network Analysis Support (GeNAS) Facility at the former RIKEN Center for Life Science Technologies (now RIKEN-IMS) for preparing and sequencing the nAnT-iCAGE libraries. This work was supported by an AMED-CREST grant awarded to N.H. by the Japan Agency for Medical Research and Development.

author Contributions
S.S., N.H. and P.C. were responsible for the concept and design of the study. M.N.Z.V. performed data processing and sequencing analysis, and wrote the paper. K.H. aided in analyses and helped revise the manuscript. S.S. and K.I. collected and characterized blood samples. All authors read and commented on drafts of the manuscript and approved the final submitted manuscript.