Abstract
Coronary artery disease (CAD) is the leading cause of death among adults worldwide. Accurate risk stratification can support optimal lifetime prevention. Current methods lack the ability to incorporate new information throughout the life course or to combine innate genetic risk factors with acquired lifetime risk. We designed a general multistate model (MSGene) to estimate agespecific transitions across 10 cardiometabolic states, dependent on clinical covariates and a CAD polygenic risk score. This model is designed to handle longitudinal data over the lifetime to address this unmet need and support clinical decisionmaking. We analyze longitudinal data from 480,638 UK Biobank participants and compared predicted lifetime risk with the 30year Framingham risk score. MSGene improves discrimination (Cindex 0.71 vs 0.66), age of highrisk detection (Cindex 0.73 vs 0.52), and overall prediction (RMSE 1.1% vs 10.9%), in heldout data. We also use MSGene to refine estimates of lifetime absolute risk reduction from statin initiation. Our findings underscore our multistate model’s potential public health value for accurate lifetime CAD risk estimation using clinical factors and increasingly available genetics toward earlier more effective prevention.
Similar content being viewed by others
Introduction
Coronary artery disease (CAD) remains the leading cause of morbidity and mortality worldwide^{1}. Estimating an individual’s risk of developing CAD over their lifetime is essential for timely and effective prevention and intervention^{2,3,4,5}. Traditional risk prediction models, such as the Pooled Cohort Equations (PCE) 10year risk score, have guided clinical decisions and preventive strategies; however, these models come with inherent limitations^{6,7,8}. A 30year or 10year window provides only a fixed, albeit extended, snapshot of risk. It neither captures the entirety of an individual’s lifetime risk nor provides dynamic, agespecific insights beyond these arbitrary periods. Most importantly, there is a growing need for models capable of both recognizing undertreated younger patients while reducing overestimation in older patients^{7,9,10}.
Current guidelines^{9,11,12} recommend the consideration of primordial risk factors in riskstratifying patients, and call for better methods of estimating lifetime risk. Recent evidence suggests that lifetime risk assessment provides a more comprehensive picture of an individual’s propensity for developing CAD across time^{13,14}. Over a longer horizon, traditional factors in combination with genomic risk can confer a disproportionately elevated risk for CAD when compared to shortterm static risk^{2,15,16,17}. For this reason, integrating genomic and traditional features into a lifetime risk assessment allows for more effective patient counseling, tailored preventive measures, and earlier interventions that may delay or prevent the onset of CAD altogether^{18,19}.
Because of the multifactorial nature of CAD, there is an increasing need for continuously updated, dynamic, and individualized CAD risk predictions that span a patient’s entire life^{2,14,20}. Understanding risk from this perspective allows for more informed and timely interventions, potentially even before the conventional risk windows are applicable.
Here we introduce the MSGene model—a multistate model designed to predict the lifetime risk of CAD, conditional on both timeinvariant and timedependent variables. Multistate models allow for the estimation of the risk of an individual transitioning between health states^{21,22,23,24,25} through flexible estimation of conditional probabilities by modeling the transitions between states over time. By modeling the different health states simultaneously, these states naturally account for competing risks.
MSGene is capable of modeling the dynamic transitions from risk factor states to CAD with agespecific coefficients. Critically, our approach differs from a traditional Markovbased multistate model^{21,22} by extending our model to the timeinhomogeneous case and allowing our transitions to vary with age; in addition, our model differs from traditional Cox models by allowing for nonproportional hazards.
In this study, we develop and validate the MSGene model in an application for estimating lifetime risk of CAD. We evaluate the performance compared to the traditionally employed Framingham 30year^{26} and PCE 10year^{5,6} models. Here we show the potential ability of MSGene to reduce CAD events by guiding timely initiation of statin therapy and demonstrate the benefit of a multistate framework to incorporate dynamic changes in treatment decisions for unique patient profiles.
Results
Novel multistate model with timedependent transitions
We build a novel timedependent multistate model in which age is the time scale.
For each age a and current state j (Fig. 1), we model the oneyear probabilities of transition to state k for individual i at age a, \({\pi }_{{jkia}}\), as logistic regressions conditional on both timeinvariant covariates (e.g., sex, CADPRS), and timedependent covariates (e.g., smoking, use of antihypertensives or statins) (“Methods”, Supplementary Table 1). This methodology defines an inhomogeneous Markov transition model, which can be used to compute the probability of reaching any state of interest during one’s lifetime among other quantities. Our transition model is Markovian, in that the current state and not earlier states are needed to know the probability of the next move. Also, the probabilities of transitioning between health states are allowed to vary over time and depend on covariates, which characterizes it as “inhomogeneous”. Here, we focus on the lifetime risk of CAD.
We chose the set of covariates above for comparability with existing approaches. We also report results for smaller subsets of covariates in a sensitivity analysis (“Methods”, Supplementary Table 1). To improve estimation efficiency of state and agespecific covariate coefficients, we smooth these coefficients across age using a method called tricube distanceweighted least squares regression^{27,28}. This approach considers the reliability of each raw estimate by weighting adjacent ages by their proximity and their inverse variance so that transitions with small sample sizes receive proportionately less weighting. By doing this, we can use information from a wide range of ages, which is especially useful when there are only a few cases of a particular change at certain ages. This allows for the sharing of information across ages in instances in which the number of individuals at a particular transition may be small. We calculate risk under statintreated and statinuntreated strategies by imputing the relative risk reduction of statins using estimates from 24 clinical trials^{29} on each annual agespecific transition (“Methods”). We provide an interactive application for users to calculate CAD risk based on various covariates (https://surbut.shinyapps.io/risk/).
Baseline characteristics
We considered 480,638 individuals: 260,653 (54.2%) were female with 43,855 (11.1%) incident CAD diagnoses (Table 1) with a median 29.9 years [22.4–35.1] years of followup and median age of first observation in EHR 24.3 [IQR: 18.0, 37.1] after excluding 20,534 who lacked quality controlled genotypes or had CAD at baseline (Fig. 2). We visualize the proportional representation by risk factor at each age (Fig. 1) at both baseline and throughout: ~39.6% are ultimately diagnosed with hypertension, 23.6% with hyperlipidemia, and 9.9% with Diabetes Mellitus. Furthermore, 10.5% report currently smoking and 20.3% began antihypertensive use during the course of our study. General practice data linkage was available for 46.1% of the cohort, and sensitivity analysis showed the distribution of risk factors was homogenous (Supplementary Table 2).
Model interpretation
We split the cohort of participants, with 80% of individuals in the training set (384,510 individuals) and 20% as a testing set (79,117 individuals) (Fig. 2). We report the lifetime risk remaining at any age for a given individual i of progressing to state k from state j from age A_{1} to A_{2}, where a indexes the current age and A_{2} is set conservatively at 80 (Eq. (1)).
Modeling transitions
Using our multistate approach, MSGene, we describe the overall state distribution across the lifespan in our cohort, pictured to exclude censoring at each age (Fig. 1) and also described above. At age 40 years, for example, 94.4% of individuals are in the healthy category and 0.3% in the CAD category before exclusions, with 4.1% in the hypertensive category. By age 76 years, CAD state occupancy peaks at 12.5% of uncensored individuals, and health is reduced to 27.6% of uncensored individuals. By age 80 years, 7.4% of all individuals enrolled have died.
Improved detection of early events when compared to 10year risk
When compared to the PCE, a 10% lifetime threshold using MSGene uniquely identifies 5315 (59.3%) cases versus 123 (1.3%) cases using the 10year PCE (5% threshold) alone at age 40. This reduces to <1% of cases at age 68 (vs 81% with PCE) (Supplementary Fig. 1). At age 40, MSGene showed substantially greater sensitivity for lifetime CAD events compared to PCE (event reclassification 58.2%, 95% CI 58.1–58.3%), at the cost of moderate inappropriate upclassification of lifetime nonevents (nonevent reclassification –37.3%, 95% CI 37.2–37.4%). At age 70, MSGene showed substantially greater specificity compared to PCE (nonevent reclassification 32.1%, 95% CI 31.9–32.1%), at the cost of some inappropriate down classification of events (event reclassification –12.5%, 95% CI –12.4 to –12.6%). Overall, reclassification was consistently favorable (median net reclassification index 0.12) over 40 years of consideration. Notably, MSGene identified more individuals with high genetic risk. Among individuals with predicted life risk >10%, 9.7% (95% CI 9.6–9.8%) of individuals have PRS in the top quintile, while only 3.1% (95% CI 2.9–3.2%) have PRS in the lowest quintile (Supplementary Fig. 1).
Improved calibration when compared to 30year risk score
MSGene demonstrated globally improved calibration when compared to FRS30RC. We compared the average predicted risk by sex and genomic risk strata with empirical overall incidence rates. In healthy individuals, the RMSE of MSGene is 1.06% (1.04% males, 1.09% females, SEM 0.06) while FRS30RC is 10.9% (12.1% males, 10.1% females, SEM 0.07, Supplementary Fig. 3). In contrast to MSGene, FRS30RC 30year risk increases monotonically across the lifespan. When restricting the analysis to ages 40 and 50 for whom 30 years of followup is available, the RMSE is 0.98% with MSGene when compared to 5.68% for FRS30RC. We further compute the RMSE starting from additional single risk factor phenotype states (hypertension, hyperlipidemia, and diabetes) across a subset of covariate choices (Supplementary Table 1). We found the improvement to be robust across states and covariate choices.
Dynamic effects of 10year, 30year, and remaining lifetime risk
MSGene allows for the estimation of survival curves for an individual starting from a given age, and for updated remaining lifetime curves asked over a range of ages. We compute the remaining lifetime risk when compared with FRS30RC as recalibrated for our population^{30}. First, we depict the predicted survival curve for individuals of six different genetic and sex strata starting at healthy at age 40 (Fig. 3). Under this traditional analysis, CADfree survival is projected to decline monotonically as a function of sex and genetic risk to 96.8% (95% CI 96.78–96.82) for a female in the lowest genetic strata and to 81.26% (95% CI 81.24–81.28) for a male in the highest genetic strata. However, a remaining lifetime risk curve reveals the opposite behavior: for example, a high genetic risk male has a 22.9% (95% CI 22.7–23.1%) risk without treatment at age 40, but the same highrisk male has only a 10.21% (95% CI 10.20–10.22%) risk of developing CAD if he remains CADfree at age 70. This contradicts the 10year risk prediction, in which 10year risk rises from 2.84% at age 40 to 10.21% at age 70 (Fig. 3 and Supplementary Data Tables 1–16). We compare this to FRS30RC projections^{26} and note that while remaining lifetime risk declines with age, the extended fixed window (FRS30RC) increases monotonically across age and genetic risk strata. In our cohort, the FRS30RC risk for a high genetic risk male rises from 13.4% at age 40 to 33.0% (Fig. 3) at age 70 using the recalibrated measure. When applying trialestimated statin benefit via introducing a trialestimated relative risk reduction to each annual transition probability^{29} (“Methods”, Eq. (6)) under MSGene lifetime projections, predicted absolute risk under treatment for the same high genetic risk male at age 40 improves from 22.86% (95% CI 22.85–22.87%) to 18.70% (95% CI 18.69–18.71%) over the 40year span. This is compared to a smaller decline from 10.21% (95% CI 10.19–10.22%) to 8.25% (95% CI 8.24–8.26%) at age 70. In general, MSGene assigns higher risks to high genetic risk, younger individuals, while individuals’ FRS30RC, a 30year fixedwindow approach, recognizes older individuals regardless of genetic strata.
MSGene demonstrates improved dynamic projection on timedependent transitions
An updated lifetime prediction, conditional on a patient’s current state, can be made per year, using agespecific coefficients. We use these updated predictions as covariates in a timedependent extended model^{31,32} to evaluate the performance of our model on predicting timetoevent (“Methods”). Though the scores arise from a nonCox model, since the Cox model score statistic is well defined for timedependent covariates, the concordance is also well defined for a timedependent risk score: at each event time the current risk score of the subject who failed is compared to the current (timedependent) scores of all those still at risk. The Cox model is never used for estimating the MSGene transition probabilities themselves: it is only used in the model assessment stage to evaluate concordance in a timedependent matter^{33}.
We first consider the age distribution at which an individual first exceeded a lifetime risk threshold of 10% using MSGene or FRS30RC, or using a PCEderived 10year risk threshold of >5%. Using MSGene to assess lifetime risk, 44.8% percent of individuals exceed this threshold at age 40, while 38.9% never do. With FRS30RC, 44.1% exceed this threshold at age 40, but virtually all (99.8%) exceed this threshold by age 80. Using the first age exceeded under each model as a timedependent predictor of CAD status, we find that MSGene improves model concordance by 21% (Cindex 0.73 vs 0.52, P < 2.1 \(\times\) 10^{–140}) and of the 10year index by 17.4% (Cindex 0.55, P < 2.1 \(\times\) 10^{–103}) (Fig. 4a–d).
We then use the yearly time and statevarying predictions as predictors in a timedependent Cox proportional hazard model in which one’s score is recorded annually in nonoverlapping intervals and estimate the concordance of this model. The concordance of this timedependent model using dynamic MSGene predictions exceeds that of the updated FRS30RC predictions by 0.71 vs 0.66, P < 2.9 \(\times\) 10^{–17} (Fig. 4e–g). We repeat these results using the subset with general practice (GP) records alone for both training (80%) and testing (20%) and the results hold for both the thresholding analysis (Cindex 0.71 vs 0.53, P < 2 \(\times\) 10^{–16}) and continuous timedependent analysis (Cindex 0.73 vs 0.67, P < 2 \(\times\) 10^{–16}, Supplementary Figs. 3 and 4).
Estimated benefit
Our model incorporates the estimated benefit of a treatment strategy that is imputed conditional on starting age and risk status. Using a randomized clinical trial (RCT)imputed annual risk reduction of 20% for statins on statinfree individuals^{34,35}, we observe an inverse relationship between predicted 10year risk and expected benefit. An individual with the highest genetic risk at age 40 has a predicted 10year risk (4.2%, SD 0.01) roughly equivalent to the lowest genetic risk individual at age 70 (3.9%, SD 0.01), but an expected lifetime absolute risk reduction of 5% (SD 0.01) at age 40 versus only 0.8% (SD 5 \(\times\) 10^{–2}) at age 70 (Fig. 5). When we consider the distribution of all starting states, we see that the mean absolute risk reduction is the greatest for younger individuals (4.6–7.2%; SD 0.01) across risk states at age 40, to a mean absolute risk reduction of 0.3–3.5% (SD 0.01) at age 79.
Improvement in discrimination over the cumulative horizon
When considering only the presence or absence of disease over observed time without regard to timing, the AUCROC of a model comparing the prediction of cumulative occurrence using updated MSGene lifetime score shows greater performance than that of either FRS30 or FRSRC early in the life course (Supplementary Fig. 5) (0.69 vs. 0.65, P < 2 \(\times\) 10^{–16} at age 40) and also based on precisionrecall (0.20 vs 0.16 at age 40, P < 0.01). Both metrics exceeded the estimation of lifetime risk using genetics as a predictor alone. In general, when comparing individuals captured by MSGene but not by FRS30RC, MSGene identified more women and individuals at higher genetic risk. With time, these differences were more profound (Supplementary Fig. 6).
External validation
We then performed external validation of MSGene in the Framingham Offspring (FOS) cohort, using first measurements to ensure optimal followup duration. FOS is a communitybased cohort recruited in 1971 with a median 39 years of followup [IQR 38–40], median age of enrollment 35 years [IQR 28–44] (Supplementary Fig. 7). Our analyses were on the subset of 2595 individuals who met exclusion criteria (Supplementary Fig. 7). MSGene again had favorable discrimination (age 40: 0.75 [95% CI 0.69–0.82] vs. 0.73 [95% CI 0.66–0.80]; age 55: 0.63 [95% CI 0.42–0.84] vs. 0.53 [95% CI 0.29–0.76]) and calibration (RMSE 8.4% vs. 11.3%, P < 2 \(\times\) 10^{–16}) when compared to FRS30 (Supplementary Fig. 8).
Discussion
Our study introduces a novel method called MSGene, which aims to assess the risk of developing CAD and other health states over an individual’s lifespan. We demonstrate that dynamic modeling of lifetime risk using longitudinal data and our novel multistate approach can improve both calibration and discrimination when compared with existing goldstandard approaches, such as the PCE and FRS30. Furthermore, incorporating genetic risk and the flexibility to estimate remaining lifetime risk improves the identification of younger individuals at high risk without overestimating risk in older adults, in contrast to existing fixedwindow approaches^{6,30}. Our projected benefit analysis shows that this might result in large reduction in preventable CAD events if statin therapy is guided by MSGene.
The technique utilizes generalized linear models (GLMs) to compute the transition probabilities between different states (e.g., from a healthy state or risk factor to CAD, death, or intermediate risk) for every age over the observed lifespan. The novelty derives from four features: (1) the provision of unique agedependent models via GLMs that allow the relationship of each covariate on the outcome to vary freely with time; (2) the calculation of risk conditional on timedependent states; (3) the assessment of a multistate model via timedependent Cox modeling; and (4) the unique use of the UKB EHR as a comprehensive longitudinal data resource. The study follows individuals from adulthood through their enrollment in the linked health record. By incorporating age and time dependence, this method provides annual risk estimates over the lifespan, here focused on risk assessment from ages 40 to 80 years.
Over a lifetime horizon, the dynamic change in risk makes accurate lifetime risk estimations challenging^{4,7,11}. However, leveraging genetics and multistate modeling, MSGene enhances lifetime risk predictions. This effectively identifies individuals previously deemed lowrisk. The model’s agedependent features, producing agesensitive coefficients, negate the need to rely on fixed parametric interactions between each covariate and time, a prevalent limitation in traditional models^{6}. We show that using updated estimates conditional on the dynamic state of an individual improves timetoevent prediction overall.
Through the incorporation of treatment effects, we show that those individuals with the greatest and least expected absolute risk reduction from statin therapy actually have a similar 10year risk. However, this shortterm focus is what current clinical methods rely upon^{7}. Presented effects are conservative as statin effects may magnify with duration and on CADPRS background^{19,36,37,38}.
Our approach facilitates accurate event prediction both for undercaptured young individuals and also lowerrisk older individuals who might otherwise be included in a fixedwindow approach that extends the time horizon: our median global net reclassification when compared with a 10year approach is 12.2% [IQR 5.5–18.6%] over 40 years. This in part explains the improvement in overall timedependent performance when incorporated into a timetoevent framework. Using a timedependent evaluation, the distribution of the first age at which a lifetime threshold is exceeded demonstrates that MSGene optimally identifies atrisk individuals without indiscriminately calling all patients “atrisk”. However, future work is warranted to determine optimal thresholds of lifetime risk to maximize potential benefits among highrisk younger individuals while reducing unnecessary costs and harms to lowrisk older individuals.
One of the strengths of our method is access to a significant history of electronic health records that allow us to derive estimates informed by a greater group of patients throughout the life course. Existing scores^{26,39} imply that the levels of covariates will stay fixed over the life course or require recalculation, which ignores the information within transitions through the life course. Here, our longitudinal outlook allows for individuals to be followed over a lifetime and quickly estimates what their updated risk trajectory would look like under an alternative profile.
Estimation of remaining lifetime risk is conducted using agespecific predictions informed only by individuals in the atrisk set at a given age, thus making this a true lifetime estimate. In our work, we choose a conservatively estimated age of 80 as the maximum lifetime age given the density of age estimation within our set. This estimation is possible under the assumption that risk trajectory is similar across shifting windows of age at risk but falls apart with strong calendar time trends. Given that our cohort was required to be between 40 and 69 years old in 2006, we reduced the variation in calendar effects^{5,40}.
When combined with genetic information, an emphasis on dynamically updated lifetime risk projections can uncover latent risks in seemingly healthy individuals. Determining an appropriate lifetime risk threshold is a laudable goal^{2,7}. Indeed, current guidelines^{12,40} note that genetic risk scores can identify individuals at birth with a high propensity to develop disease, but few approaches have coupled this information with realized risk stages dynamically. As age increases, shortterm risk increases, and the remaining lifetime risk is reduced, meaning that a metric focusing on shortterm risk will preferentially focus on disease in older individuals, thwarting the efforts of true prevention. It is not enough to increase the lifetime threshold to account for younger individuals as proposed in European Society of Cardiology guidelines; additional years add additional uncertainty, and thus, having tools capable of dynamically incorporating new information over the life course in combination with more comprehensive time assessments is critical to moving prevention forward. The current MSGene model is available as a risk assessment tool at https://surbut.shinyapps.io/risk, where users can compute lifetime risk of CAD based on different risk factor states and covariates. (Supplementary Fig. 9). Critically, we also provide the code to rapidly refit and compute this model for a new cohort (https://github.com/surbut/MSGene).
In this study, we use a composite of phenotypic codes to define our risk factor states. One of the challenges of developing a lifetime assessment tool surrounds the availability of continuously updated laboratory data. Using EHR data, an unbiased ascertainment of updated biometric variables at uniform intervals is challenging. We added baseline continuous laboratory data from the age of enrollment to our grid search, and this added little to our model (Supplementary Fig. 10).
A second limitation surrounds the heterogeneity of phenotyping. We define hyperlipidemia and hypertension according to validated diagnostic codes^{41}. However, there exists heterogeneity in the severity and duration of these conditions. The potential benefit of adding additional states must be balanced with the uncertainty imposed and the reduction in sample size caused by dispersion across grades of each condition. Our model is capable of incorporating health history in state specification: this resolves the loss in underlying latent risk that is often erroneously captured in EHR data when an individual’s nominal laboratory value falls secondary to medication.
One of the advantages of heterogenous data collection is a wealth of available phenotyping modalities: the UKBB has access through linkages to routinely available national health systems enhanced by selfreport and previous records^{42}. Although not all individuals included had GP data, we demonstrate that the age and prevalence of conditions is homogenous between individuals in the GP subset and otherwise (Supplementary Fig. 1) and that analysis on this subset alone results in similar model discrimination.
Third, the generalizability of our findings may be impacted by study design and sample specificity. The UK Biobank included healthier and less socioeconomically deprived individuals who were predominantly Caucasian individuals living in the United Kingdom^{43}. We provide detailed analysis in our supplement documenting that these results held by selfreported ethnicity (Supplementary Table 3). Furthermore, given that the minimum age for genotyping was 40 years old, we began our inference for risk modeling at age 40, provided they were captured in the EHR before then. Although individuals who reached age 40 prior to enrollment were appropriately at risk for the primary CAD outcome given their capture in the longitudinal EHR, they were protected from death until the time of enrollment, which may affect estimates related to the competing risk of death. For timedependent evaluation of our prediction, we conservatively leftcensored at age of enrollment to eliminate years protected from death and found that the improvements in discrimination over FRS30RC remained unchanged. We note consistent performance in external validation in the FOS cohort, where all death and CAD events occurred exclusively after enrollment. Finally, our dynamic logistic regression approach can readily be adapted to any population with minimal computational resources (https://github.com/surbut/MSGene) and we provide R code to do so.
Leveraging a unique resource of genetic and longitudinal clinical data spanning over 80 years in nearly 500,000 participants of the UK Biobank prospective cohort study, we develop MSGene, a multistate model for dynamic transitions throughout the life course to estimate lifetime risk of CAD. MSGene is wellcalibrated and discriminates early and late events both in the UK Biobank and an external validation sample. We anticipate that by providing interpretable and dynamic estimates of CAD lifetime risk, MSGene may inform future therapeutic decisions to enable more efficient and effective CAD prevention throughout the life course.
Methods
Data source
The UK Biobank (UKB) is a prospective UK populationbased study that enrolled approximately half a million adults aged 40–69 between 2006 and 2010 designed to investigate the genetic and lifestyle determinants for a wide range of diseases. Participants underwent genomewide genotyping, with linkage to longitudinal hospitalization, primary care (GP), and selfreport data dating back to 1940 (Fig. 2 and Supplementary Figs. 11 and 12)^{41}. Using the ukbpheno package (version 1.0)^{41}, we assembled detailed longitudinal data from the various sources documenting events from 1940 until December 2021 for 481,927 individuals after excluding 20,534 who lacked quality control genotyping or risk factor information (Fig. 2 and Supplementary Figs. 11 and 13). At the time of analysis, linkage to the United Kingdom General Practice (GP) Registry was available for a subset of 221,351 individuals. This assembly across data sources generated phenotypes for hypertension (Htn), diabetes mellitus (DM) (type 1 or 2), hyperlipidemia (Hld), or coronary artery disease (CAD) based on validated collections of hospitalization (HESIN), diagnostic, operation, general practice (GP) clinical and script as well as death information^{41}. We found high overlap between these phenotypes and our own lab’s previously generated HESINrestricted phenotypes^{36,44} (Supplementary Fig. 13). These phenotypes subsequently became the risk factor states in our model. Informed consent was obtained from all participants, and secondary data analyses were approved by the Mass General Brigham Institutional Review Board 2021P002228. Secondary data analysis of UKB was performed under application number 7089.
Because of the longitudinal nature of this cohort, every individual is observed at first encounter with the electronic health record (EHR) in early adulthood (median age 24.2 years). We selected UKB participants free of CAD at age 40 and followed until the occurrence of CAD, death, or loss to followup (median followup 29.9 years). We categorize individuals by their condition at entry into our cohort at age 40 years provided they have been observed in the EHR (Fig. 2). We then reevaluate at each age the risk set as those individuals who have (1) been observed and (2) have not been censored for a given phenotype. We demonstrate the diversity of data sources and the corresponding availability of each data source over time for all considered phenotypes (Supplementary Fig. 12). In general, our model allows for the progression from CAD to death, but we report here the risk of progression to CAD on CADfrr individuals at baseline.
Polygenic risk
An additional novelty of our model is the incorporation of the dynamic effects of genetics over time. We use CAD polygenic risk score (PRS) as released through the UKB resource^{45} and compute on individuals with adequate genotype information after quality control and after controlling for the principal component axes obtained from the common genotype data in the 1000 Genomes reference data set using standard methods^{45}. Data supporting these scores were entirely from external GWAS data (the Standard PRS set) as conducted by Genomics PLC (Oxford, UK) under UKB project 9659^{45}. We demonstrate that the distribution of PRS is similar across entry age (Supplementary Fig. 14).
States and competing risk
The unique nature of our multistate model features eight mutually exclusive states and restricts oneyear transitions as follows (Fig. 1), with death as the final absorbing state from which one cannot exit. At any age across the life course, cumulative onestep transitions can be assessed (Fig. 1). Possible transitions are as follows:

1.
Healthy to a single risk factor (Hypertension, Hyperlipidemia, Diabetes), CAD or death; (healthy to healthy also allowed but not displayed for clarity)

2.
Single risk factor to corresponding double risk factor, CAD or death;

3.
Double risk factor to triple risk factor, CAD or death;

4.
Triple risk factor to CAD or death;

5.
CAD to death.
Predictions with age as the time scale
Our model inferences are made per year using the individuals who are in a particular risk state at a given age (Fig. 2 and Supplementary. Fig. 9). Predictions can, therefore, be made over a requested time interval using the product of agespecific risks for which coefficients were estimated from individuals who were in the atrisk subset during a given period. While enrollment in the UK Biobank required that an individual be alive at age 40 to enroll for genotyping, it did not require that the individual be risk factorfree, and therefore we use this information to assign individuals into risk categories for inference from age 40 onward. We exclude individuals with CAD at baseline from our predictions. After deriving the model construction, we describe the computation and evaluation of state and agespecific risks below.
Statistical analysis
Let \({\pi }_{{jkia}}\) represent the annual transition probability from state j to state k for individual i during year a. We let the states j and k represent phenotypes ascertained from the electronic health record. ‘From’ states include Health; single risk factor states: Hypertension (Ht), Hyperlipidemia (Hld), Diabetes Mellitus Type 1 and Type 2 (DM), double risk factor states: Ht & Hld, Ht & Dm, Dm & Hld; Triple risk factor states: Dm & Hld & Ht; and Coronary Artery Disease (CAD). “To” states include all of the “From” states and Death. For our purposes, we report the progression to CAD or death from any of the “From” states.
For p covariates for a given individual transitioning from state j to k, we refer to the following.
Where \({\hat{\beta }}_{{jkar}}\) represents the coefficient of variable r in the prediction of the transition probability from state j to state k at age a. Taking the inverse logit of the estimate returns the absolute risk for any individual i as a function of the agespecific coefficients and their p covariates, such that the annual risk estimate from state j to state k is given by:
Here we let \({{{{{\boldsymbol{X}}}}}}^{{{\prime} }}\) represent the 1 × P vector of individuals and covariate profiles at a given age and \({{{{{\boldsymbol{\beta }}}}}}\) represents the P × 1 vector of age and statestatespecific smoothed coefficients. Smoothing is described in subsequent sections.
When estimating Eq. (3), state j represents the “from” state and state k represents the “to” state. To account for censoring, an individual exits the “at risk” group for transition inference when they are lost to followup. We use a 1year interval over which to discretize age intervals and independently estimate the \({\pi }_{{jkia}}\) agedependentstate to state transitions. We have both fixed and timevarying covariates. The effect of all covariates on transitions varies with age. Timeinvariant covariates include sex and polygenic risk score (PRS). UKB assesses current smoker status at enrollment subsequent change in smoking status is not sufficiently reliable for our purposes. Therefore, we use it as a timeinvariant covariate for model estimation. Timedependent covariates include both antihypertensive and statin prescriptions. These are reevaluated each year using prescription data from the UKB^{46}. Our final prediction model allows for continuous updates of smoking and medication usage in evaluating agespecific transition probabilities. We use 80% of our data as training and 20% as testing (Fig. 2) which divides our data into a training set for model fitting using 384,510 samples and a testing data set of 79,117 unique individuals. Before carrying out the analysis, we selected these covariates for compatibility with the existing Pooled Cohort and Framingham 30year equations^{6,26}. As a sensitivity analysis, we also report the results after removing certain covariates (Supplementary Table 1).
Predicted interval risk
Predicted risk of progressing to state k from state j for individual i over any period ranging from age A_{1} to A_{2} is (Eq. (4)):
where a indexes the current age. Accordingly, we compute risk for an individual i of progressing to state k from state j where L is the maximum age of life and a is the currently observed age (Eq. (5)). For our purposes, we choose L = 80 in line with the available data by age in the UK Biobank.
The remaining lifetime risk can be modified to account for treatments by applying a constant relative risk reduction to the agespecific transition probabilities in expression 4. Then the interval risk under treatment can be calculated using the peryear risk reduction RR of progressing to state k from state j over an interval from age A_{1} to A_{2} as (Eq. (6)):
For the purposes of this manuscript, we are interested in state k = CAD. We impute the relative risk reduction of 0.20 from 24 trials of statin therapy^{29}. Within our model, we constrain each individual’s predicted probabilities across states per year to sum to one such that for each age a, the probability of staying within the given state is the complement of the sum of transitions over K to the alternative states:
We choose j because it is mostly above 50% and the constraint in 6 will guarantee that for a given age, the probabilities for an individual of a particular covariate profile sum to 1. The alternative of fitting a polytomous regression is computationally much more demanding and gives approximately the same answer. Here we report the product of these conditional onestep transitions from the healthy state as the state of most interest for primary prevention.
Flexible smoothing of regression coefficients across ages
The unadjusted observed coefficients may be inherently noisy and certain transitions may have low sample sizes. Therefore, we extract the unsmoothed coefficients \({\hat{\beta }}_{{jka}}\) for each age and state transition from the logistic regressions in Eq. (2). To borrow information across ages, we fit a smoothed locally estimated polynomial regression in which for each state to state transition and each covariate^{27,28} (LOESS) (Supplemental Fig. 15). The loess weights are proportional to the product of the inverse variance of each estimated coefficient and the tricube distance function of nearby ages. This will smooth adjacent ages more closely together proportional to the cube of their distance d from the age in question, where:
We consider the neighboring unsmoothed coefficients as those within an adjusted window length, and if the age in question is within 5 years of the minimum or maximum age, we extend the adjusted window by 5 years.
We then use weighted least square regression to obtain the weighted sum of neighboring coefficients where the design matrix X is the “N” neighbor’ by degree +1 matrix X and y is the N × 1 vector of unsmoothed coefficients.
A vignette showing this process on a sample calculation is shown here https://surbut.github.io/MSGene/vignette.html. Furthermore, flexible window choices and polynomial degrees can be found here: https://surbut.shinyapps.io/testapp/. All analyses were performed with R (version 4.3.1). The underlying MSGene framework which could be adapted for other datasets is available at https://github.com/surbut/MSGene with accompanying vignettes. All plots are available at https://surbut.github.io/multistate2/index.html.
Distance weighting refers to the fact that, for each age point, neighboring data within a dynamic window is incorporated, and expanded at the age extremes to mitigate boundary bias. We then apply a tricube weight function that assigns higher weights to nearer neighbors, tapering to zero beyond the window, capturing the assumption of locality. The regression is stabilized by inverse variance weighting, emphasizing more reliable (less variable) observations, in line with the assumption that points with lower variance provide more accurate information. The design matrix, accounting for polynomial terms up to a specified degree, facilitates flexible modeling of the agecoefficient relationship, without imposing a global functional form. This approach assumes that polynomials can locally approximate the agerelated trends in coefficients and that these local fits can be aggregated to represent the global trend. It also presumes the initial coefficient estimates are sensible and variances are correctly specified for accurate weighting.
Standard error of projection
We sample with replacement (“bootstrap”) our training data 1000 times and extract the corresponding means and standard errors of each projection across bootstrapping iterations. We compute the remaining lifetime risk by setting the maximum age to 80, according to the density of observations in our training data. We impute a relative risk (RR) of CAD from statins of 0.20^{34,47,48}; notably, the RR may be larger for some groups, such as those with elevated CADPRS^{36,49}, and for longer periods of time and thus this reflects a conservative estimate^{50}. We apply this benefit only to individuals not previously on statins.
For the RMSE, we report the standard error of the mean across strata. For proportions, we report the standard error of the sample proportion as \(\surd (\hat{p}(1\hat{p})/n)\) where \(\hat{p}\) represents the sample proportion.
Performance metrics
For each age, we compare the average predicted score by genomic (<20%, 20–80%, and >80%) and sex strata, and report the root mean squared error (RMSE) as the difference in the average empirical and predicted cumulative incidence rate for each PRS and sex group.
For the area under the receiver operator curve (AUCROC) and precisionrecall analysis, we compute the area under each curve using each score as a predictor of cumulative case or control status computed using values for individuals at each year plotted.
Comparison to 10year PCE and 30year Framingham CAD risks
For comparison of 10year risk, we use the 2018 PCE with baseline covariates (total cholesterol, HDLcholesterol and systolic blood pressure, current smoking) obtained from UKB enrollment data and update each prediction^{26} with age, diabetes, and medication use according to available records. This technique was used in the Framingham 30year risk development to validate new longer window estimates in which age was iteratively updated with all other risk factors at their baseline values^{26}.
For comparison of calibration to 30year risk, we used the 2009 complete (lipids, nonBMI) Framingham 30year equation (FRS30) and update each prediction^{26} with age, diabetes, and antihypertensive use according to available records, consistent with detailed formulae within the FRS30. Given the differing populations, we recalibrated^{51} according to the mean levels of each covariate and baseline hazard in the UKB sample (FRS30RC). For fair comparison, we report our results against FRS30RC (Supplementary Fig. 16). Precision and discrimination analysis described as above. We compute and display the predicted 30year risk for individuals from ages 40–70 according to this model.
Agedependent model assessment
To evaluate model concordance, we first use the age and statespecific predicted risk scores for each individual which arise from our MSGene system of smoothed logistic regressions  as covariates in a timedependent Cox model, in which an individual is featured in nonoverlapping intervals with their respective score and event status. In the evaluation stage, we conservatively left censor individuals until enrollment. The way in which the test set is defined reflects the clinical application of the model. The Cox model is never used for estimating the MSGene transition probabilities themselves: it is only used in the model assessment stage to evaluate concordance in a timedependent matter^{33}.
We also calculate the minimum age at which an individual would exceed prespecified risk thresholds for MSGene, PCE, and FRS30. We divide every individual’s observed trajectory into nonoverlapping intervals, indicating when one or all thresholds are achieved and when an event occurs. For example, if an individual is observed from ages 40–70 and exceeds one risk score at age 45 and the other at age 52 and has an event at age 68, his period of study will be divided into 4 intervals: the period from age 40 to 44 in which he exceeds the threshold with neither score, the period from 45 to 51 in which he exceeds the threshold only with score 1, the period from 52 to 67 in which exceeds with both scores, and the period from 68 to 80 in which he has had an event and exceeded in both scores. We left censor in this analysis at age of enrollment. We fit independent timedependent Cox models^{31} to this expanded data set, and again conservatively left censor until enrollment. For both analyses, we report the concordance index (Harrell’s C) with confidence intervals derived from 1000 bootstrapping iterations^{52}. Concordance asks, at each time interval, whether the level of one timevarying score exceeds the level of an alternative score for individuals with an event versus those without^{33}.
Internal and external model assessment
We internally assess the RMSE (Supplementary Table 1) of models using a finite number of covariates (here sex, polygenic risk, and timedependent smoking and antihypertensive use) for eight statespecific transitions built on a training set and independently assessed on our testing set. In addition, external validation was performed on individuals in the Framingham Heart Study Offspring cohort (FOS)^{53} by comparing the model fits estimated in the UKB with 10year and lifetime risk estimates with individuals in the FOS (Supplementary Fig. 7) for whom genetic data are available. This is a communitybased Northeastern United States cohort that was recruited in 1971, median age [IQR] 33.0 years [27.0, 41.0] and followed through 2013. Clinical data and incident disease for 3836 participants, and genetic data for a subset (2611), were available through the database of Genotypes and Phenotypes (dbGaP; accession phs000007.v33.p14). We compare these with the PCE and FRS30 (original score, calibrated for this population) estimates calculated at Exam 1 and compute the RMSE and AUC over the 30year followup period. Informed consent was obtained from all participants, and secondary data analyses of dbGAPbased FOS and UKB were approved by the Mass General Brigham Institutional Review Board applications 2016P002395 and 2021P002228.
Calculating net reclassification
For net reclassification indices, at each age of consideration, we defined NRI_{event} as the net proportion of cases correctly reclassified by MSGene Lifetime (MSGene_{LT} >10%) as compared to a tenyear PCE:
We defined NRI_{nonevent} as the net proportion of controls correctly reclassified by MSGene lifetime risk <10%:
Marginal calculation
We also allow, for the absorbing states of CAD and death, the possibility of computing the probability of progressing through any path to CAD (“marginal”). The calculation of progressing to state K from state J through any path over N years is the product of N transition matrices T in which the j,k element for matrix T_{ia} is the probability of progressing from state j to k at age a for individual of covariate profile i:
For absorbing states, the k,k probability is 1. This vignette is available at https://surbut.github.io/MSGene/usingMarginal.html.
Data availability
All data from the UK Biobank (https://www.ukbiobank.ac.uk/enableyourresearch/applyforaccess) are made available to researchers from universities and other institutions with genuine research inquiries following institutional review board and UK Biobank approval. This research was conducted using the UK Biobank resource under application number 7089 and approved by the Mass General Brigham institutional review board. All data from the Framingham offspring was made available from dbGap (https://www.ncbi.nlm.nih.gov/gap/) to researchers from universities and other institutions with genuine research inquiries following institutional review board approval. Data described here was accessed using accession number phs000007.v32.p13. All data generated during this study are included in this published article and its supplementary information files.
Code availability
The code for running the MSGene model is available at https://github.com/surbut/MSGene. Vignettes for running the analyses are available at https://surbut.github.io/MSGene/vignette.html and https://surbut.github.io/MSGene/usingMarginal.html. Shiny app for calculating interval risk is available at https://surbut.shinyapps.io/risk/. Code to reproduce all plots is available at https://surbut.github.io/multistate2/index.html.
References
Tsao, C. W. et al. Heart disease and stroke statistics—2023 update: a report from the American Heart Association. Circulation https://doi.org/10.1161/CIR.0000000000001123 (2023).
LloydJones, D. M. et al. Prediction of lifetime risk for cardiovascular disease by risk factor burden at 50 years of age. Circulation 113, 791–798 (2006).
Wilkins, J. T. et al. Data resource profile: the cardiovascular disease lifetime risk pooling project. Int. J. Epidemiol. 44, 1557–1564 (2015).
Bundy, J. D. et al. Cardiovascular health score and lifetime risk of cardiovascular disease. Circulation: Cardiovascular Quality and Outcomes https://doi.org/10.1161/CIRCOUTCOMES.119.006450 (2020).
Grundy, S. M. et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/ APhA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: executive summary. Circulation 139, e1082–e1143 (2019).
Yadlowsky, S. et al. Clinical implications of revised pooled cohort equations for estimating atherosclerotic cardiovascular disease risk. Ann. Intern. Med. 169, 20–29 (2018).
Navar, A. M. et al. Earlier treatment in adults with high lifetime risk of cardiovascular diseases: what prevention trials are feasible and could change clinical practice? Report of a National Heart, Lung, and Blood Institute (NHLBI) Workshop. Am. J. Preventive Cardiol. 12, 100430 (2022).
Jaspers, N. E. M. et al. Prediction of individualized lifetime benefit from cholesterol lowering, blood pressure lowering, antithrombotic therapy, and smoking cessation in apparently healthy people. Eur. Heart J. 41, 1190–1199 (2020).
Navar, A. M., Fonarow, G. C. & Pencina, M. J. Time to revisit using 10year risk to guide statin therapy. JAMA Cardiol. 7, 785 (2022).
Zeitouni, M. et al. Performance of guideline recommendations for prevention of myocardial infarction in young adults. J. Am. Coll. Cardiol. 76, 653–664 (2020).
LloydJones, D. M., Albert, M. A. & Elkind, M. The American Heart Association’s focus on primordial prevention. Circulation 144, e233–e235 (2021).
2021 ESC Guidelines on cardiovascular disease prevention in clinical practice  European Heart Journal  Oxford Academic. https://academic.oup.com/eurheartj/article/42/34/3227/6358713 (2021).
Berry, J. D. et al. Lifetime risks of cardiovascular disease. New Engl. J. Med. 366, 321–329 (2012).
Conner, S. C. et al. A comparison of statistical methods to predict the residual lifetime risk. Eur. J. Epidemiol. 37, 173–194 (2022).
Michos, E. D. & Choi, A. D. Coronary artery disease in young adults. J. Am. Coll. Cardiol. 74, 1879–1882 (2019).
O’Sullivan, J. W. et al. Polygenic risk scores for cardiovascular disease: a scientific statement from the American Heart Association. Circulation 146, e93–e118 (2022).
Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults. J. Am. Coll. Cardiol. 72, 1883–1893 (2018).
Sniderman, A. D. & Furberg, C. D. Age as a modifiable risk factor for cardiovascular disease. Lancet 371, 1547–1549 (2008).
Wang, N., Woodward, M., Huffman, M. D. & Rodgers, A. Compounding benefits of cholesterollowering therapy for the reduction of major cardiovascular events: systematic review and metaanalysis. Circulation: Cardiovasc. Qual. Outcomes 15, e008552 (2022).
Marma, A. K., Berry, J. D., Ning, H., Persell, S. D. & LloydJones, D. M. Distribution of 10year and lifetime predicted risks for cardiovascular disease in US adults: findings from the National Health and Nutrition Examination Survey 2003 to 2006. Circ. Cardiovasc. Qual. Outcomes 3, 8–14 (2010).
LeRademacher, J. G., Therneau, T. M. & Ou, F.S. The utility of multistate models: a flexible framework for timetoevent data. Curr. Epidemiol. Rep. 9, 183–189 (2022).
Wreede, L. C, de, Fiocco, M. & Putter, H. mstate: an R package for the analysis of competing risks and multistate models. J. Stat. Soft. 38, 1–30 (2011).
Brookmeyer, R. & Abdalla, N. Multistate models and lifetime risk estimation: application to Alzheimer’s disease. Stat. Med. 38, 1558–1565 (2019).
Neumann, J. T. et al. A multistate model of health transitions in older people: a secondary analysis of ASPREE clinical trial data. Lancet Healthy Longev. 3, e89–e97 (2022).
Jack, C. R. et al. Rates of transition between amyloid and neurodegeneration biomarker states and to dementia among nondemented individuals: a populationbased cohort study. Lancet Neurol. 15, 56–64 (2016).
Pencina, M. J. et al. Predicting the 30year risk of cardiovascular disease. Circulation https://doi.org/10.1161/CIRCULATIONAHA.108.816694 (2009).
Cleveland, W. S. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat. Assoc. 74, 829–836 (1979).
Cleveland, W. S. & Devlin, S. J. Locallyweighted regression: an approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83, 596–610 (1988).
Cholesterol Treatment Trialists’ (CTT) Collaborators. et al. The effects of lowering LDL cholesterol with statin therapy in people at low risk of vascular disease: metaanalysis of individual data from 27 randomised trials. Lancet 380, 581–590 (2012).
Rospleszcz, S. et al. Validation of the 30Year Framingham risk score in a german populationbased cohort. Diagnostics 12, 965 (2022).
Therneau, T. M. (n.d.). Using Time dependent covariates and time dependent coefficients in the cox model [PDF file]. Retrieved from https://cran.rproject.org/web/packages/survival/vignettes/timedep.pdf.
Therneau, T. M. & Grambsch, P. M. Modeling Survival Data: Extending the Cox Model. (Springer New York, 2000).
Therneau, T. M. & Watson, D. A. The concordance statistic and the cox model. Technical Report No. 85, Department of Health Sciences Research, Mayo Clinic. (2017).
Cholesterol Treatmentors. et al. The effects of lowering LDL cholesterol with statin therapy in people at low risk of vascular disease: metaanalysis of individual data from 27 randomised trials. Lancet 380, 581–590 (2012).
Chou, R. et al. Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the US Preventive Services Task Force. JAMA 328, 754 (2022).
Natarajan, P. et al. Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting. Circulation 135, 2091–2101 (2017).
Thanassoulis, G., Sniderman, A. D. & Pencina, M. J. A longterm benefit approach vs standard riskbased approaches for statin eligibility in primary prevention. JAMA Cardiol. 3, 1090–1095 (2018).
Pencina, M. J. et al. The expected 30year benefits of early versus delayed primary prevention of cardiovascular disease by lipid lowering. Circulation 142, 827–837 (2020).
HippisleyCox, J. et al. Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2. BMJ 336, 1475–1482 (2008).
Arnett, D. K. et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular disease. J. Am. Coll. Cardiol. 74, e177–e232 (2019).
Yeung, M. W., Van Der Harst, P. & Verweij, N. ukbpheno v1.0: an R package for phenotyping healthrelated outcomes in the UK Biobank. STAR Protoc. 3, 101471 (2022).
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Fry, A. et al. Comparison of sociodemographic and healthrelated characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).
Klarin, D. et al. Genetic analysis in UK Biobank links insulin resistance and transendothelial migration pathways to coronary artery disease. Nat. Genet. 49, 1392–1397 (2017).
Thompson, D. J. et al. UK Biobank release and systematic evaluation of optimised polygenic risk scores for 53 diseases and quantitative traits. Preprint at https://doi.org/10.1101/2022.06.16.22276246 (2022).
Darke, P. et al. Curating a longitudinal research resource using linked primary care EHR data—a UK Biobank case study. J. Am. Med. Inform. Assoc. 29, 546–552 (2022).
Ference, B. A. et al. Effect of longterm exposure to lower lowdensity lipoprotein cholesterol beginning early in life on the risk of coronary heart disease: a Mendelian randomization analysis. J. Am. Coll. Cardiol. 60, 2631–2639 (2012).
Ference, B. A. How to use Mendelian randomization to anticipate the results of randomized trials. Eur. Heart J. 39, 360–362 (2018).
Mega, J. et al. Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy. Lancet 385, 2264–2271 (2015).
Marston, N. A. et al. Predicting benefit from evolocumab therapy in patients with atherosclerotic disease using a genetic risk score: results from the FOURIER trial. Circulation 141, 616–623 (2020).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Harrell, F. E. et al. Evaluating the yield of medical tests. JAMA 247, 2543–2546 (1982).
Feinleib, M., Kannel, W. B., Garrison, R. J., McNamara, P. M. & Castelli, W. P. The Framingham Offspring Study. Design and preliminary data. Prev. Med. 4, 518–525 (1975).
Acknowledgements
S.M.U. is supported by T32HG010464 from the National Human Genome Research Institute. S.K. is supported by the NIH (K23HL169839) and the American Heart Association. (23CDA1050571). S.J.C. is supported by a grant of the Korea Health Technology R&D Project through the Korea. Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant no.: HI19C1330). A.C.F. is supported by grants 1K08HL161448 and. R01HL164629 from the National Heart, Lung, and Blood Institute. P.T.E reported receiving grants from the NIH (1RO1HL092577, 1R01HL157635, and, 5R01HL139731), the American Heart Association Strategically Focused Research Networks. (18SFRN34110082), the European Union (MAESTRIA 965286), Bayer AG (to the Broad Institute), IBM Health (to the Broad Institute), Bristol Myers Squibb (to Massachusetts General Hospital), and Pfizer (to Massachusetts General Hospital). A.G. is supported by National Institutes of Health (NIH) grant nos R01CA227237, R01CA244569, and R21HG010748, and awards from the Claudia Adams Barr Foundation, Louis B. Mayer Foundation, Doris Duke Charitable Foundation, Emerson Collective, and Phi Beta Psi Sorority. P.N. is supported by grants R01HL1427, R01HL148565, and R01HL148050 from the National Heart, Lung, and Blood Institute, and grant 1U01HG011719 from the National Human Genome Research Institute. The authors would like to acknowledge Leslie Gaffney of the MITBroad Communications Lab for her invaluable graphics and copyediting advice.
Author information
Authors and Affiliations
Contributions
S.M.U., G.P., A.G., and P.N. conceived and designed the analysis. S.M.U. performed the analysis and wrote the software. S.M.U., G.P., and A.G. developed the statistical method. M.W.Y. provided critical data organization tools and critical confirmatory data analysis. L.T., S.K., and K.T. provided critical statistical insight. S.K., S.M.C., A.S., J.G., K.P., A.C.F., P.T.E., and P.N. provided critical insights into the clinical relevance of the findings. S.M.U. drafted the manuscript and M.W.Y., G.P., A.G., and P.N. provided critical revision. All authors contributed to the manuscript and approved the submitted version.
Corresponding author
Ethics declarations
Competing interests
During the course of the project, M.W.Y. became an employee and stock owner of GSK. A.C.F. is cofounder of Goodpath. PTE reports personal fees from Bayer AG, Novartis, and MyoKardia. GP holds equity in Phaeno Biotechnologies, is on the SAB of RealmIDX and currently consults for Delphi Diagnostics. P.N. reports research grants from Allelica, Apple, Amgen,Boston Scientific, Genentech / Roche, and Novartis, personal fees from Allelica, Apple, AstraZeneca, Blackstone Life Sciences, Foresite Labs, Genentech/Roche, GV, HeartFlow, Magnet Biomedicine, and Novartis, scientific advisory board membership of Esperion Therapeutics, Preciseli, and TenSixteen Bio, scientific cofounder of TenSixteen Bio, equity in MyOme, Preciseli, and TenSixteen Bio, and spousal employment at Vertex Pharmaceuticals, all unrelated to the present work. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Johannes Neumann and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Urbut, S.M., Yeung, M.W., Khurshid, S. et al. MSGene: a multistate model using genetic risk and the electronic health record applied to lifetime risk of coronary artery disease. Nat Commun 15, 4884 (2024). https://doi.org/10.1038/s41467024492969
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467024492969
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.