Introduction

Cognitive-related health issues are increasingly prominent against the backdrop of global aging. As of 2019, there were already more than 50 million people living with dementia worldwide, with an average of one new case of dementia every three seconds. More than 75 million people will suffer from dementia by 20301,2. China is a country with a huge aging population. Currently, the number of older adults aged 60 years and above in China has reached 264.02 million, accounting for 18.70% of the population3.

We have explored some factors (e.g., diet, activity status, etc.) associated with cognitive impairment in the Chinese older adults through our previous studies4,5. In order to better clarify whether there is some causal association, we therefore investigated the issue with the help of Mendelian randomization (MR). It is a method that uses genetic variation as the instrumental variable (IV) to make causal inferences about the effect of exposure on outcome6,7,8. However, this method has limitations when the exposure is ordered multi-categorical variables as such. Briefly, if an observed multi-level categorical exposure (i.e., intensities of exercise) is the manifestation of an underlying continuous exposure (i.e., motivation to exercise) passing certain cut-off points, the observed level of category may be stable even if the latent exposure has changed, because the latter haven’t crossed one of the cut-offs. This may violate the assumption of exclusivity of the instrumental variable9,10,11. In this scenario, disregarding the relationship between a latent, invisible continuous exposure and its external, visible categorical manifestation will bias the MR/IV estimation of effect size (if the latter is a strong stepwise mediator), or even confound the claim of casual association (if the latter is mostly a surrogate)12.

In view of the above issues, we aimed to develop and test an analytical pipeline of Mendelian randomization, tailed to ordered multi-categorical exposures and to further explore potential causal associations between cognitive levels and their influencing factors in the Chinese older adults.

Materials and methods

Data sources

Data for this study were obtained from the Chinese Longitudinal Healthy Longevity Survey (CLHLS) database (1998–2018)13,14. This was the largest and most extensive cohort of older adults in China, encompassing more than 40,000 older adults in total. The project mainly collected information on the socio-demographic characteristics, lifestyles, and health status of participants through a questionnaire (self-report). A total of 908 surviving older adults aged 65 years or older (based on the age of the participants when they first participated in the survey) were selected from individuals who participated in both the questionnaire and whole genome sequencing (WGS). Meanwhile, we collated their baseline information. After quality control (QC)15 of the genetic data, a total of 897 older adults aged 65–110 years were included (no missing data), with a male to female ratio of approximately 1:2.74.

Cognition

Individual cognitive data were collected and calculated by the Mini-mental State Examination (MMSE). This scale provides a comprehensive assessment of the individual's general ability, reactivity, attention and calculation, recall, language, comprehension and self-coordination. The total score is 30, with higher scores indicating a better cognitive status16.

Living habits

Based on the results of our previous studies conducted on CLHLS4,5, we selected those factors that had some association with individual cognitive level for inclusion in this study. The variables in this section included individual drinking status, exercise status, dietary habits, and activity participation at their baseline. The drinking status were categorized by drinking (including abstainers) and never drinking. Exercise status was similarly divided into two categories according to exercise and never exercise. Diet and activity participation were recorded in terms of frequency, in descending order of "frequently", "occasionally" and "rarely/never", with higher values representing lower frequency. Other variables and detailed descriptions were given in Additional file 1.

Gene

Genetic data were obtained from the Longevity Study (CLHLS) conducted by Prof. Yi Zeng.17 The data were generated from whole genome sequencing and genotyped by Illumina humanomnizhhua-8 BeadChips. The chip was created by selecting optimally tagged single nucleotide polymorphisms (SNPs) from all three phases of the the International HapMap Project as well as from the Thousand Genomes Project. It covered 900 015 SNPs, including 600 000 common variants (Minor allele frequency, MAF ≥ 5%), 290 000 rare variants (MAF < 5%), and 10 000 SNPs present only in Chinese and other Asian populations. In addition to directly sequenced SNPs, the project also used IMPUTE software (version 2) to infer genotypes for SNPs with MAF ≥ 0.01 (The 1000 Genomes Project as a reference), representing approximately 85.38%17.

We performed quality control on the raw genetic data and examined the population stratification (multidimensional scaling, MDS)15. A total of 3 240 266 SNPs were included in this study with genotype data covering autosomes 1–22. The genotype of each SNP was recorded in the form of an additive model (i.e., 0, 1, 2). Chromosome location information was obtained from hg19.

Genome-wide association study

Genome-wide association study (GWAS)15 was conducted by plink 1.9 for individual cognitive level, living habits and potential confounders (e.g., educational status and stroke/CVD5). Among them, continuous variables were analyzed using linear regression, while categorical variables were performed through logistic regression. The covariates included were age, sex, ethnicity, and 10 dimensions representing stratified characteristics of the population obtained by multidimensional scaling. Subsequently, the Quantile–Quantile plot (Q-Q plot) and Manhattan plot were plotted based on the results obtained from GWAS.

Instrumental variables

Instrumental variables (IVs) were selected from independent loci that were significant in GWAS. In order to better validate the generalizability and stability of our method, we relaxed the selection criterion of p-value to "close to significance" when selecting instrumental variables. Based on the results of GWAS, independent and significant SNPs (p < 1 × 10–5) satisfying the three assumptions of instrumental variables8 were selected as IVs for each exposure after screening to exclude those loci that were associated with known confounders.

Specifically, the sub-routine groups significant SNPs into genome segments, that is, clumps, that are independently and significantly associated with the outcome, each indexed by one of the highly significant SNPs within. Each new clump is iteratively isolated by (1) greedily search the next most significant SNP with p-value no larger than 1e-5 (via –clump-p1 0.00001) but not yet enclosed in any existing clumps, that is, the so-called index variant of the soon to be new clump, then (2) adding variants within close proximity to the index, both in terms of physical units (+ /− 500 KB, via –clump-kb 500) and linkage disequilibrium (r2 > 0.2, via –clump-r2 0.2). The index SNPs are then taken as independent loci for an outcome of interest18.

Causal inference for ordered multi-categorical exposure

In order to better fit the CLHLS data while addressing the limitations in the traditional Mendelian randomization, a simple and reliable causal inference method was presented in this study. An ordered multi-categorical variable (exposure) can be regarded as an unknown continuous latent variable divided by the corresponding threshold. Thus, the model for causal inference with the instrumental variables can be expressed as (Fig. 1):

Figure 1
figure 1

Directed acyclic graph (DAG) of instrumental variables in the causal inference of ordered multi-categorical variables. "G" denotes instrumental variables (genetic variation), "X" indicates exposure, "X*" denotes latent variables, "Y" and "U" represent outcome and confounders, respectively.

The association between variables can be represented by the following equation:

$$ X^{*} = \theta_{0} + \theta G + \gamma U + \varepsilon_{2} $$
(1)
$$ X = \left[ {\begin{array}{*{20}c} {1,} & {X^{*} < t_{1} } \\ {2,} & {t_{1} < X^{*} < t_{2} } \\ {3,} & {X^{*} > t_{2} } \\ \end{array} } \right. $$
(2)
$$ Y = \beta_{0} + \beta X^{*} + \alpha U + \varepsilon_{1} $$
(3)

\(({\text{G}},X,Y)\) are the observable data. Without loss of generality, we assume that \(U\sim N(0,{\sigma }_{u}^{2})\), \({\epsilon }_{1}\sim N(0,{\sigma }_{1}^{2})\), \({\epsilon }_{2}\sim N(0,{\sigma }_{2}^{2})\), \(G\), \(U\), \({\epsilon }_{1}\) and \({\epsilon }_{2}\) are mutual independent of each other. We want to estimate the causal effect (\(\beta \)) of \({X}^{*}\) on \(Y\). Combining the above equation shows that the relationship between \(Y\) and \(G\) can be expressed by the following equation:

$$ Y = \left( {\beta_{0} + \theta_{0} \beta } \right) + \theta \beta G + \left( {\alpha + \gamma \beta } \right)U + \left( {\varepsilon_{1} + \beta \varepsilon_{2} } \right) $$
(4)

\(G\), \(U\), \({\epsilon }_{1}\) and \({\epsilon }_{2}\) are independent of each other, so the parameter \(\theta \beta \) can be identified. Regression is performed on G utilizing Y. The regression coefficient is the estimate of the parameter \(\theta \beta \). If non-zero \(\theta \) can be identified, the parameter \(\beta \) can be identified.

Note that

$$ P\left( {X = 1{|}G} \right) = P\left( {X^{*} \left\langle {t_{1} } \right|G} \right) = P\left( {\gamma U + \varepsilon_{1} < t_{1} - \theta_{0} - \theta G} \right) $$
(5)
$$ P\left( {X = 2{|}G} \right) = P\left( {t_{1} < X^{*} \left\langle {t_{2} } \right|G} \right) = P\left( {\gamma U + \varepsilon_{2} < t_{2} - \theta_{0} - \theta G} \right) - P\left( {\gamma U + \varepsilon_{2} < t_{1} - \theta_{0} - \theta G} \right) $$
(6)
$$ P\left( {X = 3{|}G} \right) = P\left( {X^{*} > t_{2} |G} \right) = P\left( {\gamma U + \varepsilon_{2} > t_{2} - \theta_{0} - \theta G} \right) $$
(7)

Since \(E\left(\gamma U+{\epsilon }_{2}\right)=0\), \(D\left(\gamma U+{\epsilon }_{2}\right)={\gamma }^{2}{\sigma }_{u}^{2}+{\sigma }_{2}^{2}\), let \({\sigma }^{2}={\gamma }^{2}{\sigma }_{u}^{2}+{\sigma }_{2}^{2}\), then

$$ P\left( {X = 1{|}G} \right) = {\Phi }\left( {\frac{{t_{1} - \theta_{0} - \theta G}}{\sigma }{ }} \right) $$
(8)
$$ P\left( {X = 2{|}G} \right) = {\Phi }\left( {\frac{{t_{2} - \theta_{0} - \theta G}}{\sigma }{ }} \right) - {\Phi }\left( {\frac{{t_{1} - \theta_{0} - \theta G}}{\sigma }{ }} \right) $$
(9)
$$ P\left( {X = 3{|}G} \right) = 1 - {\Phi }\left( {\frac{{t_{2} - \theta_{0} - \theta G}}{\sigma }{ }} \right) $$
(10)

where \(\Phi \) is the cumulative distribution function of standard normal. From the above equation, it is clear that the parameter \(\theta \) is not identifiable. Further we used Eq. (8), which after transformation yields

$$ \Phi^{ - 1} \left( {P\left( {X = 1|G} \right)} \right) = \frac{{t_{1} - \theta_{0} }}{\sigma } - G \times \frac{\theta }{\sigma } $$
(11)

Subsequently, it is easy to know that \(\frac{\theta }{\sigma }\) can be identified by this equation. If one really wants to estimate the causal effect, they could try to estimate \({\sigma }^{2}\) from a separate study. \({\sigma }^{2}\) is connect to heritability of the continuous latent exposure \({X}^{*}\). According to Eq. (1) and the definition of \({\sigma }^{2}\), \({{V}_{{X}^{*}}={V}_{G}+\sigma }^{2}\), where \({V}_{{X}^{*}}\) is the remainder of variance of \({X}^{*}\) after accounting for non-confounding covariants (e.g., age, sex, and first few genotype principal components), and \({V}_{G}\) is variation in \({X}^{*}\) attributed to the genotype. If one knows the heritability \({h}^{2}={V}_{G}/{V}_{{X}^{*}}={V}_{G}/{({V}_{G}+\sigma }^{2})\), one could estimate \({V}_{G}\), then \({\sigma }^{2}\)—the part of variation in latent variable \({X}^{*}\) not attributed to genotype could be estimated, and thus one be able to estimate the true effect \(\theta \) of the MR/IV model, and lastly, the true effect of \({X}^{*}\) on outcome Y.

For combining all observations, we used the maximum likelihood estimation (which produces smaller variance and mean squared error19) for the calculation. The log-likelihood function is given as:

$$ log\mathop \prod \limits_{i = 1}^{n} P\left( {X_{i} = x_{i} |G_{i} = g_{i} } \right) = \mathop \sum \limits_{i = 1}^{n} \log \left( {P\left( {X_{i} = x_{i} |G_{i} = g_{i} } \right)} \right) $$
(12)

where \(P\left({X}_{i}={x}_{i}|{G}_{i}={g}_{i}\right)\) was defined by Eqs. (810), which is easily generalizable to exposures of more than 3 levels.

The above derivation still holds if there are multiple instrumental variables (i.e., \(G\) is a vector). We can first use the observed data \((G,X)\) to get the estimate \(\widehat{\theta }\) of \(\theta \) by the maximum likelihood estimation, and then perform a regression on \(\widehat{\theta }G\) using \(Y\) (same as two-stage regression). The estimated \(\widehat{\beta }\) of the causal effect (β) can be obtained by this process. Because the estimate of \(\theta \) (i.e., the estimate of \(\frac{\theta }{\sigma }\)) in this method differs from the true value by a constant multiple (\(\sigma \)), the result of this two-stage estimate also has a difference of a constant multiple (\(\sigma \)) from the true value. If this estimate is significantly non-zero, then the true causal effect is significantly non-zero. Therefore, this method can be used to infer whether ordered multi-categorical exposures (commonly found in questionnaires) have a significant causal effect on outcome (IVs need to be found with genetic data). In this study, the confidence interval (CI) of the causal effect was estimated by the bootstrap (the "boot" package of R software).

We conducted simulations using ten data sets randomly generated (n = 1000, Group A–G in which \(Y\) was correlated with \({X}^{*}\) and the last three groups were uncorrelated) combined with repeated sampling (500 times per group), and the results validated the validity and reliability of this method for testing potential causal associations. The simulation results were shown in Table 1.

Table 1 Simulation of causal inference for ordered multi-categorical exposure.

Furthermore, the factors (\(X\)) identified by previous studies as significantly associated with cognitive level (\(Y\)) were used as exposures and analyzed in conjunction with our causal inference method. The exposures were drinking status, the intake frequency of fish, fruit, garlic, legume, meat, sugar, and vegetable, exercise status, and the participation in activities such as housework, mahjong, open-air activities, pet ownership, reading, and television/radio, respectively.

For the statistically significant results among them, we used the MR-Egger method to test the potential horizontal pleiotropy.

Role of the funding source

The funder of the study had no role in the study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Ethics approval and consent to participate

The Research Ethics Committees of Peking University (number IRB00001052-13074) and Duke University granted approval for the CLHLS, including collections of questionnaire data with written informed consent before participation. The study was performed in accordance with the Declaration of Helsinki. This study only showed the secondary aggregated data, and we did not include any data of their personal information, including name, identity information, address, telephone number, etc. None of the authors in this study had access to identifying patient information during the analysis of the data.

Results

After quality control, a total of 897 older adults were included in the study. Their mean (SD) age was at 97.19 (6.70) years and their mean (SD) cognitive score was 21.42 (7.34). The genetic data covered 3 240 266 loci (autosomes 1–22). Meanwhile, the population distribution of our sample was highly homologous to the Asian population in the 1000 Genomes Project, and there was no population stratification in this study. The results were shown in Fig. 2.

Figure 2
figure 2

Population stratification of the sample (multidimensional scaling method). "EUR" stands for European, "ASN" for Asian, "AMR" for American, and "AFR" represents African origin (this part of the data was from the 1000 Genomes Project). "OWN" represents the data involved in this study.

Results of genome-wide association analysis

Genome-wide association analysis detected 512 significant loci with p < 1 × 10–5 adjusted for covariates. Among them, four independent loci were associated with cognition. Seventy-eight independent loci were detected for categorical exposures, distributed among drinking status (6), fish (7), fruit (3), garlic (5), legume (8), meat (8), sugar (2), vegetable (4), exercise status (7), housework (6), mahjong (3), open-air activities (5), pet ownership (6) read (4), and TV/radio (4). In addition, 11 and 3 independent loci were found in education and stroke/CVD, respectively. The results were shown in Additional file 2.

Among the above loci, one locus reached genome-wide significance (p < 5 × 10–8)—rs78069066 (p = 1.21 × 10–8, from drinking status). The Manhattan and Q-Q plots of GWAS results for each variable were detailed in Additional file 3.

Causal inference results for ordered multi-categorical exposures

We explored potential causal associations using a causal inference approach for ordered multi-categorical exposures. Instrumental variables were selected from independent loci that were significant in GWAS, and loci with known confounding were excluded. Finally, a total of 78 independent SNPs were selected as instrumental variables. The IVs included for each exposure were the same as the independent loci screened above, as detailed in additional file 2.

After multiple replicate calculations (500 replicates for each exposure), the results showed a positive causal association between individuals' exercise status and their own cognitive level (\(\widehat{\beta }\) = 1.883, 95%CI 0.182—3.512). In contrast, there was no causal association between diet, activity, and cognitive level of older adults. The specific results were shown in Table 2.

Table 2 Causal inference results for ordered multi-categorical exposure.

In response to the statistically significant results, a further sensitivity analysis confirmed that there was no horizontal pleiotropy between cognitive level and exercise status (p > 0.05, Table 3). This proved that the causal effect described above was valid and reliable.

Table 3 Results of the horizontal pleiotropy test.

Discussion

Currently, Mendelian randomization is the mostly commonly accepted method of causal inference in medical field. The traditional MR, however, has a relatively limited field of application and usually prefers causal inference of exposure and outcome in the form of continuous variables. If categorical variables were directly analyzed as exposures, the estimated causal effects would be inaccurate10,20. Although the methodology in this field has been expanded somewhat in recent years, causal inference involving categorical variables has focused more on cases where the outcome is a categorical variable21,22,23, and studies of categorical exposures are still lacking.

Causal inference with exposure in the form of categorical variables such as binary categories was highly sensitive to the choice of thresholds. Since categorical exposures typically only contain effects associated with boundary points and category changes and do not adequately capture subtle changes (failure to change category) in its original variables (continuous exposures), true causal effect are often difficult to explore. In addition, the causal estimates calculated by MR can sometimes be difficult to interpret. Therefore, it has been suggested that MR should focus more on testing for potential causal effects rather than trying to compute estimates of causal effects24,25.

The innovative method in this study treats ordered multi-categorical variables as divided from continuous variables by specific boundary points, and on this basis, the causal association between ordered multi-categorical exposures and continuous outcomes was inferred and uncovered through maximum likelihood estimation. This method can easily and effectively explore the causal associations between exposures and outcomes. Meanwhile, this added a new theoretical basis to the field and provided evidence to support the development of subsequent methods. In addition, this method estimated causal effects that differed from the true effects by a constant multiple. This situation was consistent with the idea mentioned in previous studies that using coarsened measures (categorical exposures) as exposures for estimation leads to bias, which would amplify or reduce their effect estimates without inverting their sign10.

In addition, the study confirmed a potential causal association between exercise status and individual cognitive level through our method. The results suggested that exercise could delay the cognitive decline. A recent international study that analyzed the brains of hundreds of deceased older adults found that individuals who participated more in daily exercise had higher levels of presynaptic proteins and better synaptic integrity in their brains during old age26 This is the first time that the positive effects of exercise on synaptic function have been demonstrated in humans. Earlier studies have shown that only 10 min of moderate intensity running per day can improve mood and cognitive performance27. Exercise such as running increases blood flow to five cortical areas: l-DLPFC, l-FPA, r-DLPFC, r-VLPF and r-FPA. The stimulation received by these areas played a very important role in improving mood and cognition. Regardless of the activity, high-frequency exercise invariably increases the cognitive load of older adults. After exercise with high cognitive load, it enhanced functional connectivity in the superior frontal gyrus and prefrontal cortex and reduced functional connectivity in the middle occipital gyrus and postcentral gyrus at rest, which in turn delayed cognitive decline28.

On the other hand, the positive effects from exercise were not really lasting26. This also demonstrated to some extent the importance of maintaining a long-term, healthy exercise habit in the older adults. The adoption of measures such as appropriate physical activity was necessary no matter what age. As Dr. Robert S. Wilson said29, "It's never too late to start participating in such activities. Even in your 80 s, cognitive stimulation activities can delay the onset of Alzheimer's disease.

Conclusion

The causal inference method developed in this study combined with the maximum likelihood estimation addressed the problem of difficult causal inference for ordered multi-categorical exposures and compensated for the limitations of traditional MR methods. This not only provides us a practical tool, but also a different way of developing methodologies in this field. In addition, we uncovered a potential positive causal association between exercise and cognition through a more in-depth analysis based on previous studies. This result further confirmed the positive effect of exercise on cognitive health in Chinese older adults and added stronger evidence for delaying cognitive decline in the older individuals.