Background & Summary

Parkinson’s disease (PD) is the second most common neurodegenerative disorder with an average age of onset of 60 years and a prevalence of about 1–2% in industrialized countries1. The overall incidence of the disease is increasing, and projections indicate that there will be three times as many individuals affected by PD2 by 2030. PD is characterized by the loss of dopaminergic neurons of the substantia nigra3, as well as formation of intracellular Lewy bodies consisting primarily of α-synuclein4. The resulting depletion of dopamine (DA) manifests symptoms broadly relating to movement and coordination: resting tremors, bradykinesia, rigidity and postural instability3. Additional non-motor symptoms often precede the more overt motor features by several years, including anosmia, sleep disorders and constipation5. Most PD cases are classified as sporadic, with inherited familial forms of the disease accounting for a mere 5% of all cases3. Though the exact cause is unknown, a combination of genetic predisposition (including mutations in the leucine-rich repeat kinase 2 (LRRK2) gene6,7, α-synuclein8 (SNCA), parkin9,10 (PARK2), PTEN-induced putative kinase 111 (PINK1) and DJ-112 (PARK7)) and environmental factors are thought to be the primary events in disease induction.

With no definitive test for PD, current diagnosis is dependent on clinical observation of overt symptoms. However, overlap with other neuropathological disorders can make accurate diagnosis difficult, leading to misdiagnosis and incorrect treatment plans13,14. Additionally, the presentation of symptoms, especially in early PD, is highly heterogeneous in nature15, further muddying the waters with regards to confidence in individual diagnoses. There is a high demand for diagnostic procedures utilizing clinically-relevant biomarkers of PD: the ability to routinely test for biomarkers through a minimally invasive approach would make for a powerful diagnostic tool. This is especially true for biomarkers indicating the earliest stages of PD, as early diagnosis and intervention will likely lead to better prognostic outcomes, as well as limiting misdiagnoses.

In previous work, we showed how a series of acylcarnitine metabolites could be detected in the blood metabolome profiles of PD patients, serving as a biomarker of PD at its earliest stage16. The sensitive detection of these biomarkers by LC-MS/MS opens up the possibility of diagnosis from blood samples, even in the early stages of PD. Further to this, we decided to investigate whether yet more PD biomarkers could be found within the blood transcriptome. In fact, Infante et al. previously reported differences in transcriptomic expression between LRRK2 G2019S mutated patients and idiopathic PD patients by RNA-seq17,18, indicating the potential of such an analysis for highlighting differences between PD subtypes and between PD and healthy controls.

For this study, we collected whole blood samples from PD patients at an early stage of disease progression and healthy controls, with an aim to identify potent transcriptomic biomarkers at high resolution using an unbiased analysis method. Specifically, we utilized the no-amplification non-tagging cap analysis of gene expression (nAnT-iCAGE) protocol19 to capitalize on the strengths of CAGE-sequencing, namely the ability to determine the RNA expression level of both known and unknown transcripts and the transcription start site (TSS) utilized, as well as prediction of promoter regions20. CAGE-sequencing also limits potential bias-generating steps introduced in the sample preparation of other sequencing methodologies. With the nAnT-iCAGE sequencing protocol in particular there is no need for PCR amplification, which is commonly carried out prior to sequencing and requires post-sequencing computational cleanup to mitigate bias introduction. Further, nAnT-iCAGE avoids the poly-A based enrichment that was carried out in previous transcriptomic analyses of blood in Parkinson’s disease17,18. As a result, with this dataset it is possible to quantify and analyse important non-polyadenylated transcripts, such as bidirectionally transcribed enhancer RNAs21.

The samples collected and described in this paper include 39 PD and 20 control whole blood transcriptome samples22. These samples focus only on early stage PD, but encompass a range of ages and genders of participants, as well as differences in clinical scores (Table 1), and thus may account for some of the heterogeneity seen across PD patients. Additionally, the samples described here were collected over two years, thus allowing for some analysis of disease progression within early PD, in addition to highlighting differences between control and disease conditions.

Table 1 Metadata of all sequenced samples available through the NBDC human database.

Methods

Blood sample collection

PD was diagnosed according to the Movement Disorders Society Clinical Diagnostic Criteria for Parkinson’s disease23. Blood samples were collected from 87 PD and 10 control patients in the first year of the study (Y1) and from 67 PD (continuing from Y1) and 10 control patients in the second year (Y2; see Fig. 1a for flow diagram). All blood was collected and immediately stored at −80 °C in PAXgene Blood RNA tubes (PreAnalytiX). From the initial set of PD samples, 30 were pre-selected for RNA extraction (Fig. 1, step 2) on the basis of the following criteria: non-smokers, no significant previous disease, early stage of disease progression (one or two on the Hoehn & Yahr (H&Y)24 scale) and a duration since disease onset of 1–3 years. Going into Y2, 12 of the sequenced Y1 patients remained in the study. Five replacement patient samples were chosen for sequencing from the remaining pool of 67 samples collected in Y2. The initial criteria were relaxed to allow a duration since onset of up to four years, though these new samples were still required to be low on the H&Y scale. Both the stored blood samples from Y1 and the newly collected Y2 samples for these five patients were sequenced along with the other 12 Y2 samples. The use of human blood was approved by the ethics evaluation committee of Juntendo University (Approval Number: 15–104) and the ethics review committee of RIKEN (H26-27). Informed consent was obtained from each participant.

Fig. 1
figure 1

Study work flow from sample preparation through to sequence processing. (a) Flow chart showing the key stages of the study, and the number of participants going through to final sequencing. Pre-sequencing RNA quality control check used BioAnalyzer, and example results for (b) Ct and (c) PD samples show good quality RNA for library preparation.

CAGE library preparation

RNA was extracted from blood samples using the PAXgene Blood miRNA kit (PreAnalytiX). Following RNA extraction, samples with low quality scores (<6.5 RIN) or low concentrations of RNA (<4.5 μg) were removed (see Technical Validation), leaving 22 PD and 10 control samples in Y1, and 17 PD and 10 control samples in Y2 (Fig. 1a, step 3). It is well documented that the blood transcriptome is highly saturated by globin RNAs25 (predominantly alpha and beta haemoglobins), which has a masking effect on the remaining lower abundance transcripts. To limit this effect, samples remaining after the RNA isolation step were depleted of haemoglobins using the GLOBINclearTM kit (Thermo Fisher Scientific). nAnT-iCAGE libraries were prepared following the protocol described in Murata et al.19. Briefly, 3 ng of total RNAs were used for the synthesis of cDNA with random primers. The cDNAs with an intact 5′-end were captured by streptavidin-coated magnetic beads, ligated to a 5′ linker containing a barcode sequence and further ligated to a 3′ linker. A second strand was synthesized to generate the final dsDNA product used for sequencing. CAGE libraries were sequenced with the 50 bp single-end mode on the Illumina HiSeq 2500 platform.

Read alignment and processing steps

Raw sequencing files available from above22 require processing before data analysis, and what follows is a brief description of the steps involved. Multiplexed reads should be split by barcodes, and ribosomal RNAs removed using rRNAdust v1.06 (in-house scripts, see Code availability). General quality of the FastQ files can be assessed per sample using FastQC26 (see Technical validation). The extracted CAGE tags can then be aligned to the current human reference genome (hg38) using a number of aligners (here STAR27 version 2.5.0a was used; see Technical validation). A genome-wide transcription start site (TSS) map of single-nucleotide resolution can be generated from the 5′ coordinates of the CAGE tags, which can then be used to define distinct TSS peaks (for instance using Paraclu28). Note, the CAGE protocol is known to introduce an additional G nucleotide to the 5′ end of the CAGE tag, so a transformation algorithm must be used to correct for this systematic G-addition bias (see Code availability).

Data Records

All raw nAnT-iCAGE sequencing data (FASTQ files, samples 1–64 corresponding to Y1 and Y2 data) as well as sample metadata are available through the NBDC human database30 (https://humandbs.biosciencedbc.jp/en/) under accession number JGAS00000000119 (controlled access)22.

Technical Validation

RNA quality control

Extracted RNA was analysed using the Agilent 2100 bioanalyzer, assessing quality and concentration of intact RNA to determine suitability for subsequent sequencing. Example high quality outputs for Y1 control (Fig. 1b) and PD (Fig. 1c) samples are shown. Only samples with a concentration in excess of 4.5 ug and RNA integrity number of 7 or higher were selected for CAGE sequencing.

Read quality and accurate base-calling

FastQC was used to assess the quality of the sequenced reads on a per sample basis, with a focus on the per base sequence quality. FastQC looks at the Phred quality score, calculated by comparing read signals to the probability of accurate base-reading. Phred scores are related to base-calling error probabilities in a logarithmic manner (Q = −10 log10 P), such that scores of 50, 40 and 30 indicate base call accuracies of 99.999%, 99.99% and 99.9% respectively. An example Y1 control sample is shown in Fig. 2a, with an aggregated plot of all Y1 samples generated using MultiQC31 shown in Fig. 2b. Though the Phred scores at the majority of the base positions were high, indicating high accuracy in the assigned base at the given nucleotides, the final base at position 48 had a very low average score of 2. Trimming the reads to exceed a mean Phred score of 30 can be easily carried out (for instance using the FASTQ Quality Trimmer from FASTX32), creating a set of sequences that are of high quality and unambiguous in nature. Trimming in this manner introduces variation in the sequence length, though the majority of reads are over 45 bases in length (Fig. 2c,d) indicating minimal loss from the original 48 base length.

Fig. 2
figure 2

Post-sequencing quality control of FASTQ files using FastQC. (a) Example FastQC plot for a control sample showing a drop in per base quality scores towards the end of the 50 bp read length. (b) Aggregated FastQC plots reveal this is a widespread phenomenon affecting all of the samples. (c) Trimming sequenced reads based on quality score introduces variety in sequence length distribution, though the majority are still greater than 45 bp in length. (d) After trimming, all samples pass the mean quality score test in FastQC.

CAGE quality control

The GLOBINclearTM kit successfully depleted the Y1 samples of haemoglobins, with the remaining globin mRNAs accounting for around 5.1% of the total sequenced tags (Fig. 3a). The proportion of globin tags in the Y2 sequenced samples was higher, averaging 29.1% of total tag counts, indicating the globin depletion was not as efficient (Fig. 3a). Many of these tags cannot be unambiguously aligned to the genome (so-called multimappers), and thus can be easily removed before downstream analyses. In general, we obtained a high rate of CAGE tags mapping unambiguously to the hg38 human genome using STAR, with an average MAPQ10 count of 5.9 million tags across the two sequencing batches (Fig. 3b). Coupled with the depletion of globin RNAs, this indicates a high-quality set of sequencing samples that can be used for blood transcriptomic analysis of early PD. Furthermore, the samples show a high degree of consensus with FANTOM5 promoters, with an average of 77.8% of all tags overlapping promoter regions (Fig. 3c). One important caveat to make note of is that one of the Y1 control samples had a lower sequencing depth, with total MAPQ10 counts of less than 1 million (Fig. 3b). Despite the fact that this sample clusters separately from the remainder of the control samples, it is still highly similar. For instance, the number of detected promoters for this sample is only slightly reduced compared with the overall promoter mapping rate for control samples (75.85% of FANTOM5 promoters versus 79.79 ± 2.8%; Fig. 3c), showing that the majority of promoters expressed in other control samples are also expressed here.

Fig. 3
figure 3

Mapping statistics and quality control of CAGE data. (a) Percentage of all sequenced CAGE tags (including multimapping tags) originating from haemoglobin genes. (b) Number of high quality, unambiguously mapping tags across all the samples. (c) Percentage of the MAPQ10 tags that overlap with the FANTOM533 promoter regions.