Data Descriptor | Open | Published:

Multi-year whole-blood transcriptome data for the study of onset and progression of Parkinson’s Disease


Parkinson’s disease (PD) is an age-related, chronic and progressive neurodegenerative disorder characterized by a loss of multifocal neurons, resulting in both non-motor and motor symptoms. While several genetic and environmental contributory risk factors have been identified, more exact methods for diagnosing and assessing prognosis of PD have yet to be established. Here we describe the generation and validation of a dataset comprising whole-blood transcriptomes originally intended for use in detection of blood biomarkers and transcriptomic network changes indicative of PD. Whole-blood samples extracted from both early-stage PD patients and healthy controls were sequenced using no-amplification non-tagging cap analysis of gene expression (nAnT-iCAGE) to analyse differences in global RNA expression patterns across the conditions. Subsequent sampling of a subset of PD patients one-year later provides the opportunity to study changes in transcriptomes arising due to disease progression.

Design Type(s) disease state design • transcription profiling design • parallel group design
Measurement Type(s) transcription profiling assay
Technology Type(s) RNA sequencing
Factor Type(s) age • duration of disease • biological sex • disease severity measurement • experimental condition
Sample Characteristic(s) Homo sapiens • blood

Machine-accessible metadata file describing the reported data (ISA-Tab format)

Background & Summary

Parkinson’s disease (PD) is the second most common neurodegenerative disorder with an average age of onset of 60 years and a prevalence of about 1–2% in industrialized countries1. The overall incidence of the disease is increasing, and projections indicate that there will be three times as many individuals affected by PD2 by 2030. PD is characterized by the loss of dopaminergic neurons of the substantia nigra3, as well as formation of intracellular Lewy bodies consisting primarily of α-synuclein4. The resulting depletion of dopamine (DA) manifests symptoms broadly relating to movement and coordination: resting tremors, bradykinesia, rigidity and postural instability3. Additional non-motor symptoms often precede the more overt motor features by several years, including anosmia, sleep disorders and constipation5. Most PD cases are classified as sporadic, with inherited familial forms of the disease accounting for a mere 5% of all cases3. Though the exact cause is unknown, a combination of genetic predisposition (including mutations in the leucine-rich repeat kinase 2 (LRRK2) gene6,7, α-synuclein8 (SNCA), parkin9,10 (PARK2), PTEN-induced putative kinase 111 (PINK1) and DJ-112 (PARK7)) and environmental factors are thought to be the primary events in disease induction.

With no definitive test for PD, current diagnosis is dependent on clinical observation of overt symptoms. However, overlap with other neuropathological disorders can make accurate diagnosis difficult, leading to misdiagnosis and incorrect treatment plans13,14. Additionally, the presentation of symptoms, especially in early PD, is highly heterogeneous in nature15, further muddying the waters with regards to confidence in individual diagnoses. There is a high demand for diagnostic procedures utilizing clinically-relevant biomarkers of PD: the ability to routinely test for biomarkers through a minimally invasive approach would make for a powerful diagnostic tool. This is especially true for biomarkers indicating the earliest stages of PD, as early diagnosis and intervention will likely lead to better prognostic outcomes, as well as limiting misdiagnoses.

In previous work, we showed how a series of acylcarnitine metabolites could be detected in the blood metabolome profiles of PD patients, serving as a biomarker of PD at its earliest stage16. The sensitive detection of these biomarkers by LC-MS/MS opens up the possibility of diagnosis from blood samples, even in the early stages of PD. Further to this, we decided to investigate whether yet more PD biomarkers could be found within the blood transcriptome. In fact, Infante et al. previously reported differences in transcriptomic expression between LRRK2 G2019S mutated patients and idiopathic PD patients by RNA-seq17,18, indicating the potential of such an analysis for highlighting differences between PD subtypes and between PD and healthy controls.

For this study, we collected whole blood samples from PD patients at an early stage of disease progression and healthy controls, with an aim to identify potent transcriptomic biomarkers at high resolution using an unbiased analysis method. Specifically, we utilized the no-amplification non-tagging cap analysis of gene expression (nAnT-iCAGE) protocol19 to capitalize on the strengths of CAGE-sequencing, namely the ability to determine the RNA expression level of both known and unknown transcripts and the transcription start site (TSS) utilized, as well as prediction of promoter regions20. CAGE-sequencing also limits potential bias-generating steps introduced in the sample preparation of other sequencing methodologies. With the nAnT-iCAGE sequencing protocol in particular there is no need for PCR amplification, which is commonly carried out prior to sequencing and requires post-sequencing computational cleanup to mitigate bias introduction. Further, nAnT-iCAGE avoids the poly-A based enrichment that was carried out in previous transcriptomic analyses of blood in Parkinson’s disease17,18. As a result, with this dataset it is possible to quantify and analyse important non-polyadenylated transcripts, such as bidirectionally transcribed enhancer RNAs21.

The samples collected and described in this paper include 39 PD and 20 control whole blood transcriptome samples22. These samples focus only on early stage PD, but encompass a range of ages and genders of participants, as well as differences in clinical scores (Table 1), and thus may account for some of the heterogeneity seen across PD patients. Additionally, the samples described here were collected over two years, thus allowing for some analysis of disease progression within early PD, in addition to highlighting differences between control and disease conditions.

Table 1 Metadata of all sequenced samples available through the NBDC human database.


Blood sample collection

PD was diagnosed according to the Movement Disorders Society Clinical Diagnostic Criteria for Parkinson’s disease23. Blood samples were collected from 87 PD and 10 control patients in the first year of the study (Y1) and from 67 PD (continuing from Y1) and 10 control patients in the second year (Y2; see Fig. 1a for flow diagram). All blood was collected and immediately stored at −80 °C in PAXgene Blood RNA tubes (PreAnalytiX). From the initial set of PD samples, 30 were pre-selected for RNA extraction (Fig. 1, step 2) on the basis of the following criteria: non-smokers, no significant previous disease, early stage of disease progression (one or two on the Hoehn & Yahr (H&Y)24 scale) and a duration since disease onset of 1–3 years. Going into Y2, 12 of the sequenced Y1 patients remained in the study. Five replacement patient samples were chosen for sequencing from the remaining pool of 67 samples collected in Y2. The initial criteria were relaxed to allow a duration since onset of up to four years, though these new samples were still required to be low on the H&Y scale. Both the stored blood samples from Y1 and the newly collected Y2 samples for these five patients were sequenced along with the other 12 Y2 samples. The use of human blood was approved by the ethics evaluation committee of Juntendo University (Approval Number: 15–104) and the ethics review committee of RIKEN (H26-27). Informed consent was obtained from each participant.

Fig. 1

Study work flow from sample preparation through to sequence processing. (a) Flow chart showing the key stages of the study, and the number of participants going through to final sequencing. Pre-sequencing RNA quality control check used BioAnalyzer, and example results for (b) Ct and (c) PD samples show good quality RNA for library preparation.

CAGE library preparation

RNA was extracted from blood samples using the PAXgene Blood miRNA kit (PreAnalytiX). Following RNA extraction, samples with low quality scores (<6.5 RIN) or low concentrations of RNA (<4.5 μg) were removed (see Technical Validation), leaving 22 PD and 10 control samples in Y1, and 17 PD and 10 control samples in Y2 (Fig. 1a, step 3). It is well documented that the blood transcriptome is highly saturated by globin RNAs25 (predominantly alpha and beta haemoglobins), which has a masking effect on the remaining lower abundance transcripts. To limit this effect, samples remaining after the RNA isolation step were depleted of haemoglobins using the GLOBINclearTM kit (Thermo Fisher Scientific). nAnT-iCAGE libraries were prepared following the protocol described in Murata et al.19. Briefly, 3 ng of total RNAs were used for the synthesis of cDNA with random primers. The cDNAs with an intact 5′-end were captured by streptavidin-coated magnetic beads, ligated to a 5′ linker containing a barcode sequence and further ligated to a 3′ linker. A second strand was synthesized to generate the final dsDNA product used for sequencing. CAGE libraries were sequenced with the 50 bp single-end mode on the Illumina HiSeq 2500 platform.

Read alignment and processing steps

Raw sequencing files available from above22 require processing before data analysis, and what follows is a brief description of the steps involved. Multiplexed reads should be split by barcodes, and ribosomal RNAs removed using rRNAdust v1.06 (in-house scripts, see Code availability). General quality of the FastQ files can be assessed per sample using FastQC26 (see Technical validation). The extracted CAGE tags can then be aligned to the current human reference genome (hg38) using a number of aligners (here STAR27 version 2.5.0a was used; see Technical validation). A genome-wide transcription start site (TSS) map of single-nucleotide resolution can be generated from the 5′ coordinates of the CAGE tags, which can then be used to define distinct TSS peaks (for instance using Paraclu28). Note, the CAGE protocol is known to introduce an additional G nucleotide to the 5′ end of the CAGE tag, so a transformation algorithm must be used to correct for this systematic G-addition bias (see Code availability).

Data Records

All raw nAnT-iCAGE sequencing data (FASTQ files, samples 1–64 corresponding to Y1 and Y2 data) as well as sample metadata are available through the NBDC human database30 ( under accession number JGAS00000000119 (controlled access)22.

Technical Validation

RNA quality control

Extracted RNA was analysed using the Agilent 2100 bioanalyzer, assessing quality and concentration of intact RNA to determine suitability for subsequent sequencing. Example high quality outputs for Y1 control (Fig. 1b) and PD (Fig. 1c) samples are shown. Only samples with a concentration in excess of 4.5 ug and RNA integrity number of 7 or higher were selected for CAGE sequencing.

Read quality and accurate base-calling

FastQC was used to assess the quality of the sequenced reads on a per sample basis, with a focus on the per base sequence quality. FastQC looks at the Phred quality score, calculated by comparing read signals to the probability of accurate base-reading. Phred scores are related to base-calling error probabilities in a logarithmic manner (Q = −10 log10 P), such that scores of 50, 40 and 30 indicate base call accuracies of 99.999%, 99.99% and 99.9% respectively. An example Y1 control sample is shown in Fig. 2a, with an aggregated plot of all Y1 samples generated using MultiQC31 shown in Fig. 2b. Though the Phred scores at the majority of the base positions were high, indicating high accuracy in the assigned base at the given nucleotides, the final base at position 48 had a very low average score of 2. Trimming the reads to exceed a mean Phred score of 30 can be easily carried out (for instance using the FASTQ Quality Trimmer from FASTX32), creating a set of sequences that are of high quality and unambiguous in nature. Trimming in this manner introduces variation in the sequence length, though the majority of reads are over 45 bases in length (Fig. 2c,d) indicating minimal loss from the original 48 base length.

Fig. 2

Post-sequencing quality control of FASTQ files using FastQC. (a) Example FastQC plot for a control sample showing a drop in per base quality scores towards the end of the 50 bp read length. (b) Aggregated FastQC plots reveal this is a widespread phenomenon affecting all of the samples. (c) Trimming sequenced reads based on quality score introduces variety in sequence length distribution, though the majority are still greater than 45 bp in length. (d) After trimming, all samples pass the mean quality score test in FastQC.

CAGE quality control

The GLOBINclearTM kit successfully depleted the Y1 samples of haemoglobins, with the remaining globin mRNAs accounting for around 5.1% of the total sequenced tags (Fig. 3a). The proportion of globin tags in the Y2 sequenced samples was higher, averaging 29.1% of total tag counts, indicating the globin depletion was not as efficient (Fig. 3a). Many of these tags cannot be unambiguously aligned to the genome (so-called multimappers), and thus can be easily removed before downstream analyses. In general, we obtained a high rate of CAGE tags mapping unambiguously to the hg38 human genome using STAR, with an average MAPQ10 count of 5.9 million tags across the two sequencing batches (Fig. 3b). Coupled with the depletion of globin RNAs, this indicates a high-quality set of sequencing samples that can be used for blood transcriptomic analysis of early PD. Furthermore, the samples show a high degree of consensus with FANTOM5 promoters, with an average of 77.8% of all tags overlapping promoter regions (Fig. 3c). One important caveat to make note of is that one of the Y1 control samples had a lower sequencing depth, with total MAPQ10 counts of less than 1 million (Fig. 3b). Despite the fact that this sample clusters separately from the remainder of the control samples, it is still highly similar. For instance, the number of detected promoters for this sample is only slightly reduced compared with the overall promoter mapping rate for control samples (75.85% of FANTOM5 promoters versus 79.79 ± 2.8%; Fig. 3c), showing that the majority of promoters expressed in other control samples are also expressed here.

Fig. 3

Mapping statistics and quality control of CAGE data. (a) Percentage of all sequenced CAGE tags (including multimapping tags) originating from haemoglobin genes. (b) Number of high quality, unambiguously mapping tags across all the samples. (c) Percentage of the MAPQ10 tags that overlap with the FANTOM533 promoter regions.

Code Availability

A number of in-house scripts are commonly used for the processing of the raw FASTQ files before alignment as well as for correction of the CAGE specific sequencing bias mentioned above (and described in more detail in supplementary note 3-e of Carninci et al.20). A brief description of these scripts follows: splitByBarcode is used to split multiplexed sequences into constituent sample FASTQ files and can be found in the MOIRAI system29; rRNAdust removes all sequences that match to known rRNA sequences with two or fewer errors and is freely available through the FANTOM5 website (; starbam2gcorrectedctss is a shell script used to convert BAM files to CTSS bed files, correcting for any additional Gs at the 5′ end, and is available upon request.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    de Lau, L. M. L. & Breteler, M. M. B. Epidemiology of Parkinson’s disease. Lancet. Neurol. 5, 525–35 (2006).

  2. 2.

    Savica, R., Grossardt, B. R., Bower, J. H., Eric Ahlskog, J. & Rocca, W. A. Time trends in the incidence of parkinson disease. JAMA Neurol 73, 981–989 (2016).

  3. 3.

    Dauer, W. & Przedborski, S. Parkinson’s disease: Mechanisms and models. Neuron 39, 889–909 (2003).

  4. 4.

    Spillantini, M. G. et al. alpha-Synuclein in Lewy bodies. Nature 388, 839–840 (1997).

  5. 5.

    Chaudhuri, K. R., Healy, D. G. & Schapira, A. H. Non-motor symptoms of Parkinson’s disease: diagnosis and management. Lancet Neurol. 5, 235–245 (2006).

  6. 6.

    Di Fonzo, A. et al. A frequent LRRK2 gene mutation associated with autosomal dominant Parkinson’s disease. Lancet 365, 412–415 (2005).

  7. 7.

    Gilks, W. P. et al. A common LRRK2 mutation in idiopathic Parkinson’s disease. Lancet 365, 415–416 (2005).

  8. 8.

    Golbe, L. I., Di Iorio, G., Bonavita, V., Miller, D. C. & Duvoisin, R. C. A large kindred with autosomal dominant Parkinson’s disease. Ann. Neurol. 27, 276–282 (1990).

  9. 9.

    Kitada, T. et al. Mutations in the parkin gene cause autosomal recessive juvenile parkinsonism. Nature 392, 605–608 (1998).

  10. 10.

    Hattori, N. et al. Molecular genetic analysis of a novel Parkin gene in Japanese families with autosomal recessive juvenile parkinsonism: evidence for variable homozygous deletions in the Parkin gene in affected individuals. Ann. Neurol. 44, 935–41 (1998).

  11. 11.

    Valente, E. M. et al. Hereditary early-onset Parkinson’s disease caused by mutations in PINK1. Science 304, 1158–1160 (2004).

  12. 12.

    Bonifati, V. et al. Mutations in the DJ-1 gene associated with autosomal recessive early-onset parkinsonism. Science 299, 256–259 (2003).

  13. 13.

    Rajput, A. H., Rozdilsky, B. & Rajput, A. Accuracy of clinical diagnosis in parkinsonism–a prospective study. Can. J. Neurol. Sci. 18, 275–8 (1991).

  14. 14.

    Hughes, A. J., Daniel, S. E. & Lees, A. J. Improved accuracy of clinical diagnosis of Lewy body Parkinson’s disease. Neurology 57, 1497–1499 (2001).

  15. 15.

    Lewis, S. J. G. et al. Heterogeneity of Parkinson’s disease in the early clinical stages using a data driven approach. J. Neurol. Neurosurg. Psychiatry, (2005).

  16. 16.

    Saiki, S. et al. Decreased long-chain acylcarnitines from insufficient β-oxidation as potential early diagnostic markers for Parkinson’s disease. Sci. Rep. 7, 1–15 (2017).

  17. 17.

    Infante, J. et al. Identification of candidate genes for Parkinson’s disease through blood transcriptome analysis in LRRK2-G2019S carriers, idiopathic cases, and controls. Neurobiol. Aging 36, 1105–1109 (2015).

  18. 18.

    Infante, J. et al. Comparative blood transcriptome analysis in idiopathic and LRRK2 G2019S-associated Parkinson’s disease. Neurobiol. Aging 38, 214.e1-214.e5 (2016).

  19. 19.

    Murata, M. et al. Detecting expressed genes using CAGE. Methods Mol. Biol. 1164, 67–85 (2014).

  20. 20.

    Carninci, P. et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626–635 (2006).

  21. 21.

    Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

  22. 22.

    Identification of RNA biomarkers in Parkinson’s disease patients. DNA DataBank of Japan, (2018).

  23. 23.

    Postuma, R. B. et al. MDS clinical diagnostic criteria for Parkinson’s disease. Movement Disorders 30, 1591–1601 (2015).

  24. 24.

    Hoehn, M. M. & Yahr, M. D. Parkinsonism: onset, progression, and mortality Parkinsonism: onset, progression, and mortality. Neurology 17, 427–442 (1967).

  25. 25.

    Shin, H. et al. Variation in RNA-Seq transcriptome profiles of peripheral whole blood from healthy individuals with and without globin depletion. PLoS One 9, 1–11 (2014).

  26. 26.

    Andrews, S. FastQC, version 0.11.8. Babraham Bioinformatics, (2010).

  27. 27.

    Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

  28. 28.

    Frith, M. C. et al. A code for transcription initiation in mammalian genomes. Genome Res. 18, 1–12 (2008).

  29. 29.

    Hasegawa, A., Daub, C., Carninci, P., Hayashizaki, Y. & Lassmann, T. MOIRAI: a compact workflow system for CAGE analysis. BMC Bioinformatics 15, 144 (2014).

  30. 30.

    Kodama, Y. et al. The DDBJ Japanese genotype-phenotype archive for genetic and phenotypic human data. Nucleic Acids Res. 43, D18–D22 (2015).

  31. 31.

    Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016).

  32. 32.

    Gordon, A., Hannon, G. J. & Gordon. FASTX-Toolkit. Hannon Lab, (2014).

  33. 33.

    Forrest, A. R. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014).

Download references


We would like to thank the Genome Network Analysis Support (GeNAS) Facility at the former RIKEN Center for Life Science Technologies (now RIKEN-IMS) for preparing and sequencing the nAnT-iCAGE libraries. This work was supported by an AMED-CREST grant awarded to N.H. by the Japan Agency for Medical Research and Development.

Author information

S.S., N.H. and P.C. were responsible for the concept and design of the study. M.N.Z.V. performed data processing and sequencing analysis, and wrote the paper. K.H. aided in analyses and helped revise the manuscript. S.S. and K.I. collected and characterized blood samples. All authors read and commented on drafts of the manuscript and approved the final submitted manuscript.

Competing Interests

The authors declare no competing interests.

Correspondence to Piero Carninci.

ISA-Tab metadata file

  1. Download metadata file

Rights and permissions

Creative Commons BY

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

The Creative Commons Public Domain Dedication waiver applies to the metadata files associated with this article.

About this article