Introduction

Parkinson disease (PD) is the second most common neurodegenerative disorder in the aged population, with a prevalence of 1% at age 65 that rises to 3% by age 75.1 The cardinal motor features of PD are resting tremor, bradykinesia, rigidity, and postural instability. A definitive diagnosis of PD can only be made at autopsy, and the accuracy of clinical diagnosis varies between 76% and 99%.2 A number of nonmotor problems can also arise during the course of the disease, including cognitive impairment, psychosis, sleep disturbances, depression, and autonomic dysfunction.

Several environmental factors have been postulated to contribute to the development of PD, including long-term exposure to heavy metals and pesticides, although these associations are far from conclusive. Clinical indicators that have been repeatedly and reliably associated with developing PD are advanced age and male sex. Smoking cigarettes, drinking coffee, and using nonsteroidal anti-inflammatory drugs are protective factors against developing PD.1,3 However, unlike diabetes or cardiovascular disease, there are no markers in the blood that can be used to prognosticate risk for PD.

PD was once thought to be completely environmental in etiology. However, mutations in at least six genes, LRRK2, PARK2, PARK7, PINK1, VPS35, and SNCA, are now known to cause monogenic forms of the disease.4,5,6,7 Furthermore, common variants in several genes including MAPT and SNCA have consistently been demonstrated to associate with typical, late-onset PD.8,9,10,11,12,13,14,15,16 Genotype data from these genes and others can be combined to create a genetic risk score that may better predict PD risk than relying on clinical and demographic data alone. Thus, the goal of this project was to compare the impact of adding family history, which reflects shared genetic and environmental factors, and specific genetic markers for new and established candidate genes with a model that includes only established clinical and environmental risk factors for PD.

Materials and Methods

Study sample

The study population was derived from 2,000 patients with PD and 1,986 controls enrolled through the NeuroGenetics Research Consortium (NGRC), which includes movement disorder clinics in Albany, NY; Atlanta, GA; Portland, OR; and Seattle, WA; and was downloaded from dbGAP (phs000126.v1.p1). All patients met UK PD Society Brain Bank clinical diagnostic criteria for PD as determined by a movement disorder specialist17 and were consecutively recruited except that patients who had an age at onset <20 years or whose race was not solely classified as “white” (by self-report) were excluded from the sample. Data on smoking behavior were collected at all sites using a standardized questionnaire. Controls had no history of parkinsonism and were either spouses of patients with PD or community volunteers.

Genotypes were derived from a genome-wide association study (GWAS) previously performed on the NGRC case–control sample.11 The NGRC GWAS data set included 811,597 single-nucleotide polymorphisms (SNPs) assayed on the Illumina HumanOmni1-Quad_v1-0_B genotyping array (Illumina, San Diego, CA). Ungenotyped SNPs in our regions of interest were imputed using the software program IMPUTE2 version 2 with the methods described by Howie et al.18 To ensure that rare variants were adequately covered, we used two phased reference panels from HapMap3 and 1000 Genomes pilot data with release dates of February 2009 and June 2010, respectively. A genotype probability of 80% or greater was used to call the most likely genotype for each SNP. LRRK2 G2019 S was genotyped separately as previously described.19

To determine which SNPs to include for risk prediction, we first constructed a list of 46 SNPs that were reported to be associated with PD at a genome-wide level of significance in one or more previous GWA studies.8,9,10,11,12,13,14,15,16 Of these 46 SNPs, 21 were directly genotyped in the original NGRC data set and the remainder (n = 25) were imputed using the HapMap3 and 1000 Genomes reference panels. Seven SNPs with >5% missing data were excluded; all of these were imputed SNPs. We also included the genotype of the LRRK2 G2019S mutation. Multicollinearity was assessed for all pairs of variants. In the case of a pair with strong correlation (r2 ≥ 0.80), the variant with more missing data was excluded. Seven SNPs were excluded for collinearity. A total of 33 variants were eligible for model inclusion.

Statistical analysis

Family history was missing in 62% of participants from the Oregon site, including 91% of controls. Furthermore, cases from this site reported a positive family history of PD (36%) more often than cases from the other three sites (15–26%); thus, all participants from Oregon, totaling 1,402, were excluded from the analysis. Family history information was not different among cases and controls ( Table 1 ) and appeared to be missing at random for the remaining three sites, although there were some differences in total percentage missing across sites. A total of 617 of the remaining participants were missing data on one or more genetic variants of interest and were excluded from the analysis. Smoking behavior was missing in 354 participants and was imputed in these participants using logistic regression including the covariates age and sex. Family history was coded using a group of four dummy variables: one dummy variable indicated whether family history was missing or unknown, and the remaining three variables indicated family history in a first-, second-, or third-degree relative, noting family history in the closest relative. Known, negative family history was the reference group. Age was coded as age at time of blood draw. The 1,967 participants available for analysis were randomly divided into “training” and “test” data sets. The “training” data set consisted of 543 patients with PD and 435 controls. The “test” data set included 594 patients with PD and 395 controls.

Table 1 Characteristics of participants

Characteristics for cases and controls were compared using a two-sample t-test with equal variance for age and Pearson χ2 tests for categorical variables in Stata version 12 (Stata, College Station, TX).

Risk-prediction analyses were conducted in R version 2.15.0. Five risk models were constructed using logistic regression. All five models included the following baseline covariates: sex, age, and smoking status (ever vs. never). Model 1 is the baseline model with only the baseline covariates. Model 2 also included whether family history was known or unknown and the degree of family history of PD. Model 3 added to the baseline model a risk allele score constructed from the following SNPs: SNCA rs11931074, SNCA rs356220, MAPT rs1800547, and the LRRK2 G2019S mutation (rs34637584). SNPs in these genes were chosen because SNCA and MAPT are the most consistently replicated PD susceptibility genes.11 LRRK2 G2019S was selected because it accounts for 1–2% of PD in populations of European origin.19 Model 4 included a risk allele score constructed from all 33 SNPs and the baseline variables to evaluate the improvement of risk prediction with additional genetic information. Model 5 included covariates from the baseline model, whether family history was known or unknown and the degree of family history, and the weighted risk allele score constructed from four variants used in model 3. Risk allele scores were calculated as the sum of the minor alleles weighted by the β coefficient of that allele from a multivariate logistic regression of genetic covariates only. Each model’s discriminatory capability was evaluated using the C-statistic, which is the area under the curve (AUC) of receiver operating characteristic analyses; in the receiver operating characteristic, the sensitivity and specificity are both based on the classification of PD cases and controls, given the risk predicted from the logistic model. A C-statistic ranges from 0.5 (no predictive ability) to 1 (perfect predictive ability). We used DeLong’s test for two correlated receiver operating characteristic curves from the pROC R package to test for statistically significant differences in AUC obtained from each model.20

Results

Participant characteristics and association with PD

A total of 1,967 participants were included in analyses. Table 1 shows the characteristics of these participants. As compared with controls, cases were mostly male, had a known family history of PD in first- and second-degree relatives, and were slightly older.

In model 1 multivariate analysis, men were three times more likely to have PD as compared with women (odds ratios (OR): 3.29 95% confidence interval (CI:) (2.52–4.31)). Neither age nor smoking was significantly associated with PD in this model ( Table 2 ).

Table 2 PD risk prediction regression estimates by model using the training set

Model 2 evaluated family history of PD adjusted for the covariates included in model 1. Those who reported a family history of PD in a first- or second-degree relative were nearly four and three times, respectively, more likely to have PD as compared with those without a family history of PD in first-, second-, or third-degree relatives (OR: 3.59, 95% CI: (1.94–6.64) and OR: 3.25, 95% CI: (1.67–6.32), respectively). Family history in a third-degree relative was not associated with PD in this model ( Table 2 ).

Characteristics of genetic variants

The characteristics of the genetic variants are shown in Supplementary Table S1 online. Minor allele frequencies for SNPs in our control sample were similar to those in the HapMap CEU population, and Hardy–Weinberg equilibrium was not significantly violated in the controls (>0.10). Three-quarters of risk alleles were common, with minor allele frequencies >10%. The minor alleles of the SNCA rs11931074, SNCA rs356220, TMEM175 rs6599388, LRRK2 rs1491942, LRRK2 rs34637584, GAK rs11248051, and HLA-DRA rs3129882 variants were associated with a significantly increased risk of PD in univariate analyses (Supplementary Table S1 online). The minor allele of the MAPT rs1800547 variant conferred a decreased risk for PD (Supplementary Table S1 online).

A multivariate analysis was conducted on a subset of four variants from three established PD genes to create the weighted risk allele score used in models 3 and 5. In this analysis, SNCA rs356220, LRRK2 rs34637584, and MAPT rs1800547 remained associated with PD, but SNCA rs11931074 did not. (Supplementary Table S2 online).

A fuller multivariate analysis was conducted using all 33 SNPs from 25 genes to create the weighted risk allele score used in model 4. In this analysis, HLA-DRA rs3129882 was associated with increased risk, whereas FAM47E rs6812193, BST1 rs11724635, and MAPT rs1800547 were associated with decreased risk of PD (Supplementary Table S3 online).

Risk allele score and risk for PD

Figure 1 shows the distribution of the weighted risk allele score by case–control status for models 3 and 5 ( Figure 1a ) and model 4 ( Figure 1b ). Histograms for both models in controls and cases are normally distributed and overlap each other extensively, although the distribution for cases is shifted slightly to the right.

Figure 1
figure 1

Distribution of weighted risk allele score by case–control status for (a) model 3/5 and (b) model 4. PD, Parkinson disease.

In model 3, the risk allele score constructed from four variants was included in risk prediction along with age, sex, and smoking. The risk allele score was associated with an approximately threefold increase in risk for PD for every one unit increase in risk allele score (OR: 2.57 95%, CI: (1.72–3.83)). Model 4 included a weighted risk allele score constructed from 33 variants in addition to age, sex, and smoking. The weighted risk allele score was also associated with a nearly threefold increase in risk of PD for every one unit increase in risk allele score (OR: 2.62, 95% CI: (2.07–3.30)) ( Table 2 ). Including either risk allele score did not attenuate the association of sex with PD observed in model 1.

Model 5 was the largest model and added the weighted risk allele score created from four variants to model 2 covariates. The weighed risk allele score remained significantly associated with PD after adjusting for family history ( Table 2 ). Similarly, the OR for family history was not attenuated by adding the risk allele score.

Discriminatory ability

The receiver operating characteristic curves for all models are shown in Figure 2 . Our first model, which included only age, sex, and smoking, had a discriminatory capacity of 0.6534 (95% CI: (0.6183–0.6885)) in the training set and was replicated in the test set (AUC: 0.6831, 95% CI: (0.6484–0.7177), pcompared to model 1 training = 0.2382). Adding family history, but no genetic markers (model 2), significantly increased discriminatory capacity to 0.6847 (95% CI: (0.6513–0.7181), pcompared to model 1 training set <0.001) in the training set and was replicated successfully with an AUC of 0.7117 (95% CI: (0.6789–0.7446), pcompared to model 1 test set = 0.002, pcompared to model 2 training set = 0.2581) in the test set. Alternatively, we added a risk allele score constructed from variants within the SNCA, LRRK2, and MAPT genes in model 3; this model was replicated in the test set and significantly increased the discriminatory capacity as compared with model 1 in the training and test sets (pcompared to model 3 training set = 0.5048, AUC = 0.6886, 95% CI: (0.6552–0.722), pcompared to model 1 training set = 0.001; AUC = 0.7047, 95% CI: (0.6712–0.7382), pcompared to model 1 test set = 0.044). Adding the weighted risk allele score created from all 33 variants to age, sex, and smoking (model 4) significantly increased the discriminatory capacity to 0.727 (95% CI: (0.6956–0.7584)) in the training set (pcompared to model 1 training set <0.001). However, the discriminatory capacity of model 4 in the test set was only 0.7047 (95% CI: (0.6718–0.7375)), and although it replicated the training set AUC (pcompared to model 4 training set = 0.3359), it was not significantly higher than that of model 1 (pcompared to model 1 test set = 0.1193). Adding the weighted risk allele score constructed from four variants, which increased prediction, to model 2 (model 5) increased the AUC to 0.7112 (95% CI: (0.679–0.7434)) in the training set and was replicated in the test set (AUC: 0.729, 95% CI: (0.6968–0.7611)), pcompared to model 5 training set = 0.4435). The discriminatory capacity of model 5 was significantly higher than that of model 1 (pcompared to model 1 training set <0.001, pcompared to model 1 test set <0.001). We then compared with model 2 to determine if the genetic risk score improved risk prediction in addition to family history. The discriminatory capacity of model 5 was significantly greater than that of model 2 in the training set and the test set (pcompared to model 2 training set = 0.003, pcompared to model 2 test set = 0.04).

Figure 2
figure 2

Receiver operating characteristic curves comparing model 1 with model 2, model 3, model 4, and model 5 in the (a) training and (b) test sets. AUC, area under the curve.

Sensitivity analysis

We performed a sensitivity analysis to evaluate the impact of including subjects with missing or unknown family history information. ORs for all covariates and AUCs in all models remained unchanged when those with missing family history (n = 107) were excluded from the training set (Supplementary Table S4 online).

Discussion

In this study, we compared five models of PD risk prediction. All five models contained the covariates age, sex, and smoking. In model 3, a weighted risk allele score constructed from four variants in the SNCA, LRRK2, and MAPT genes was added to the baseline model, whereas model 4 included a weighted risk allele score constructed from a total of 33 SNPs in 25 genes. A larger risk allele score was associated with greater risk for PD for models 3 and 4. Although each one unit increase in risk allele score was associated with a nearly threefold greater risk of PD in either model, using the risk allele score in a model to predict risk for PD had a fairly low (0.69–0.73) discriminatory ability. Pepe and colleagues21 reported that ORs of this size, which are commonly observed in complex diseases, have little impact on the C-statistic. In addition, they estimate that the contribution of the predictors included in a model requires an OR of about 16 (corresponding to an AUC of 0.84) to achieve reasonable discrimination. Therefore, even though the OR for our combined genetic data was associated with a nearly threefold significant increase in risk, this OR was not strong enough to translate into significant discriminatory capacity for our genetic models. The common variants identified for almost all complex diseases, including PD, have very modest ORs. Identifying additional genetic variants with considerably larger effects will be necessary to achieve any substantial improvement in the AUC.

Several risk prediction studies have been published for various chronic diseases such as cardiovascular disease, diabetes, and breast and prostate cancer. The discriminatory ability of the models including genotype information ranges from 0.53 to 0.61.22,23,24,25,26 23andMe recently published a paper including several models of risk prediction for PD based on between 9 and 803 genetic variants, resulting in discriminatory ability ranging from 0.55 to 0.61,10 which is within the range reported for other complex diseases. In general, better discriminatory ability is seen from genetic variants included in risk prediction models for diseases with autoimmune etiology (e.g., age-related macular degeneration, psoriasis, and Crohn disease), ranging from 0.72 to 0.80.27,28,29

Generally, genetic risk prediction for common diseases has resulted in low discriminatory ability, which may, in part, be due to limitations in this method of risk prediction. Specifically, one limitation with many modeling approaches is that they may not accurately approximate the biological processes underlying the molecular pathogenesis of PD. Another limitation is the common practice of excluding highly correlated SNPs from analysis, as we have done, which might diminish predicted risk given that there is evidence that poorly performing but highly correlated markers can add substantially to model performance.30 Third, logistic regression analysis implicitly assumes a multiplicative risk model. But it is unclear whether a multiplicative model or an additive model best fits genetic risk data. The discriminative power (e.g., AUC) of the multiplicative risk model is greater and risks predicted under a multiplicative model are more extreme than those predicted under an additive risk model.31 Without knowing the mode of biological interaction underlying gene variant contribution to pathogenesis of disease, choosing the wrong model will cause overestimation or underestimation of risk and predictive ability of the risk model.31 Fourth, as mentioned earlier, combined risk from genetic variants must be quite large to observe any significant increase in discriminatory ability. Fifth, the sensitivity and specificity of a model are dependent on the subjects included for risk prediction.32,33 As such, misclassification due to heterogeneity in disease etiology, case–control status, or exposure variables will affect the model’s predictive ability.

The goal of this study was to compare the predictive ability of family history vs. specific genetic risk variants. Adding family history data to standard demographic risk factors for PD resulted in significantly better discriminatory ability than demographic risk factors alone. Adding a weighted risk allele score to family history information also significantly improved prediction, increasing discriminatory capacity from 0.71 to 0.73. Family history is a crude genomic measure, incorporating not only shared genetic variants but also shared environmental risk factors and interactions. Because both family history and the risk allele score were as strongly associated with PD in the model that contained both (model 5) as in the models that analyzed each separately (models 2 and 3), we hypothesize that family history incorporates genetic risk factors, including rare variants with larger effect sizes, that are different from genetic risk factors identified through genome-wide association studies. These differences may reflect different underlying etiology of patients with PD that could be used to identify more homogeneous subgroups for further study. However, self-reported family history is subject to recall bias, whereas assessment of genetic risk factors is an objective measure. Because of this, predicted risks derived from self-reported family history may be overestimated.

Practical genetic risk prediction for PD will require identification of more genetic risk factors, especially those contributing to familial disease. A recent study used genetic complex trait analysis to quantify the heritability of PD. Overall, the heritability of PD was estimated at 0.27; known GWAS SNPs in PD regions contributed 0.03 to the heritability estimate,34 indicating that a large proportion of the genetic variance is not yet accounted for. Furthermore, using this method, early onset cases had lower heritability than late onset cases, an unexpected result. The authors hypothesized that rare genetic variants contributing to early onset cases—generally understood to have a greater familial and genetic component—were poorly accounted for with standard genotyping platforms. In our study, we observed minimal improvement in risk prediction when we included a weighted genetic risk score. This reflects the observations that known variants account for only ~10% of the heritability of PD and that genes involved in early onset cases, which may have larger effect sizes, are not generally included on standard genotyping platforms and are difficult to impute. In addition to finding more genetic variants, better methods are needed to appropriately model genetic risk for disease, accounting for the molecular biology of associated genetic variants and allowing for interactions between both genes and environmental factors.

Disclosure

The authors declare no conflict of interest.