Development and validation of a hypertension risk prediction model and construction of a risk score in a Canadian population

Chowdhury, Mohammad Ziaul Islam; Leung, Alexander A.; Sikdar, Khokan C.; O’Beirne, Maeve; Quan, Hude; Turin, Tanvir C.

doi:10.1038/s41598-022-16904-x

Download PDF

Article
Open access
Published: 27 July 2022

Development and validation of a hypertension risk prediction model and construction of a risk score in a Canadian population

Mohammad Ziaul Islam Chowdhury^1,2,3,
Alexander A. Leung^1,4,
Khokan C. Sikdar⁵,
Maeve O’Beirne²,
Hude Quan¹ &
…
Tanvir C. Turin^1,2

Scientific Reports volume 12, Article number: 12780 (2022) Cite this article

2245 Accesses
5 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Identifying high-risk individuals for targeted intervention may prevent or delay hypertension onset. We developed a hypertension risk prediction model and subsequent risk sore among the Canadian population using measures readily available in a primary care setting. A Canadian cohort of 18,322 participants aged 35–69 years without hypertension at baseline was followed for hypertension incidence, and 625 new hypertension cases were reported. At a 2:1 ratio, the sample was randomly divided into derivation and validation sets. In the derivation sample, a Cox proportional hazard model was used to develop the model, and the model's performance was evaluated in the validation sample. Finally, a risk score table was created incorporating regression coefficients from the model. The multivariable Cox model identified age, body mass index, systolic blood pressure, diabetes, total physical activity time, and cardiovascular disease as significant risk factors (p < 0.05) of hypertension incidence. The variable sex was forced to enter the final model. Some interaction terms were identified as significant but were excluded due to their lack of incremental predictive capacity. Our model showed good discrimination (Harrel’s C-statistic 0.77) and calibration (Grønnesby and Borgan test, $\chi^{2}$ statistic = 8.75, p = 0.07; calibration slope 1.006). A point-based score for the risks of developing hypertension was presented after 2-, 3-, 5-, and 6 years of observation. This simple, practical prediction score can reliably identify Canadian adults at high risk of developing incident hypertension in the primary care setting and facilitate discussions on modifying this risk most effectively.

Prediction model and assessment of probability of incident hypertension: the Rural Chinese Cohort Study

Article 27 February 2020

Development of a risk prediction score for hypertension incidence using Japanese health checkup data

Article 27 December 2021

Development of a risk prediction model for incident hypertension in Japanese individuals: the Hisayama Study

Article 31 May 2021

Introduction

Hypertension, which affects more than 1 in 5 Canadians¹, is a common medical condition and is the leading modifiable risk factor for preventable cardiovascular morbidity and mortality². Hypertension prevention and blood pressure management in hypertensive patients is considered a major public health concern³. For decades, the focus of interventions has been on improving hypertension detection, treatment, and control, but relatively little work has been done to promote primary prevention. Evidence suggests that the risk of progression to hypertension depends on several factors. Older age, female sex, increased body mass index (BMI), family history of hypertension, premature cardiovascular disease, sedentary lifestyles, unhealthy diet, and high sodium consumption are among the factors reported as significant predictors of hypertension⁴.

Screening people at greater risk of hypertension opens the possibility of promoting individualized preventive initiatives because we will know who to target, what to target, where to target, and how to target^5,6. A prediction model helps screen high-risk individuals by estimating their probability of developing hypertension within a particular time⁷. Over the past decades, many prediction models have been developed in different populations to predict incident hypertension^{8,9,10,11,12,13,14,15}, but their performance in accurately forecasting it varies. To the best of our knowledge, prediction models for the risk of incident hypertension that directly address the Canadian population have not yet been established. One method for predicting the risk of developing hypertension in Canadian populations was to use an existing model and evaluate its performance through external validation of the model. However, in light of the following considerations, we opted to construct a new model rather than externally validate an existing model. First, prediction models are determined by an equation that includes risk factors, risk coefficients (multiplying factors that assign an etiological weight to single factors), and the general population's survival probability or baseline risk without the disease¹⁶. These elements vary depending on the type of population, especially when very different cultures are compared (i.e., European countries and Asian countries). Second, each population has a different risk of contracting the disease, and each population may have a different distribution of risk factors that weigh differently in determining the disease¹⁶. Furthermore, the disease may occur with varying probability, resulting in a different survival rate without it. Third, heterogeneity in predictor effects (the same predictor may have different prognostic values in different populations), differences in outcome incidence, and differences in case-mix between the development and validation cohorts can all have a significant impact on a model's predictive performance and frequently result in poor performance when applied to a different population^17,18,19. Furthermore, many existing models were restricted to people of a specific ethnicity or those who were already at high risk, or only included a limited number of clinical variables⁴. Because of these facts, the performance of a prediction model can vary significantly by population. As a result, the prediction model's accuracy is frequently acceptable for that index population but is not necessarily generalizable to populations other than the one for which the model was developed¹⁶. We assessed this by applying a few published hypertension prediction models to our population and comparing their predictive performance. Prediction models cannot be transferred directly from one population type to another^20,21,22. The lack of a hypertension risk index specific to the Canadian population prompted us to create a new hypertension prediction model using one of Canada's largest cohort studies, which will aid local clinicians and healthcare providers in clinical decision-making, planning, and proper management of hypertension-related healthcare services.

Faced with the lowest national rates of blood pressure control in over a decade, effective strategies to identify Canadians at the highest risk of developing high blood pressure to prevent the onset of hypertension have become more relevant than ever¹. To this end, we created and internally validated a simple and practical risk prediction model for incident hypertension in the Canadian adult population. We also derived the point-based risk score from the developed model to facilitate clinical practice use for decision-making.

Methods

Study population

The study subjects were from Alberta’s Tomorrow Project (ATP) cohort data. ATP is a province-wide prospective cohort study and consists of Alberta’s residents, aged 35–69 years, without any history of cancer, other than non-melanoma skin cancer²³. ATP contains baseline and longitudinal information on socio-demographic characteristics, personal and family history of the disease, medication use, lifestyle and health behavior, environmental exposures, and physical measures. ATP was designed to be representative of healthy middle-aged adults in Alberta. A more detailed description of ATP and its recruitment process is provided in the supplementary material (Appendix 1).

Our study cohort consists of 25,359 participants who completed ATP’s CORE questionnaire and consented to have their data linked with Alberta’s administrative health data. Linking with administrative health data was done to establish the necessary longitudinal follow-up to determine hypertension incidence. We excluded 6996 participants from the analysis who had hypertension at baseline and did not meet eligibility criteria (i.e., free of hypertension at baseline). We also excluded 41 participants who responded to hypertension status questions at baseline as “don’t know” or “missing”. Eighteen thousand three hundred twenty-two participants were included in the final analysis.

Selection of candidate variables

Before commencing the analysis, we compiled a list of available potential candidate variables. We determine the possible candidate variables for inclusion in model development based on a literature search^4,24, variables that have been used in the past²⁵, and discussion with content experts. For this study, we considered 29 candidate variables for inclusion in the model. We deliberately did not consider genetic risk factors/biomarkers as potential candidate variables given our model’s intended clinical application. Inclusion of the genetic risk factors in the model can reduce the model’s usability due to a lack of readily available information.

Definition of outcome and variables

The outcome incident hypertension was determined from linked administrative health data using a coding algorithm. We used the relevant ICD-9 and ICD-10 codes (ICD-9-CM codes: 401.x, 402.x, 403.x, 404.x, and 405.x; ICD-10-CA/CCI codes: I10.x, I11.x, I12.x, I13.x, and I15.x) and a validated hypertension case definition (two physician claims within two years or one hospital discharge for hypertension) to define hypertension incidence (sensitivity 75%, positive predictive value 81%)²⁶.

Out of 29 candidate variables, 11 were continuous, and 18 were categorical. Continuous variables remained continuous in the model developed and categorized only for deriving risk scores. A detailed description of the variables and their categorization is provided in the supplementary material (Appendix 2).

Missing values

Our dataset has missing values on several candidate variables ranging from 0 to 26%. Information on missing values for different candidate variables is presented in the supplementary table (Table S1). We used multiple imputation for missing data²⁷. This technique predicts the missing values by utilizing the existing information from other available variables²⁸ and then substitute the missing values with the predicted values to create a complete dataset. Multiple imputation by chained equations (MICE) was used to impute the missing values using Stata’s “ice” command²⁹.

Statistical analysis

Before imputing missing values, the required assumption “missing at random” for performing multiple imputations was checked. We compared the study characteristics of those with missing with those without missing information using appropriate tests (unpaired t-test or the χ²-test). Continuous variables were expressed as the mean (SE), and categorical variables were expressed as numbers (percentage of the total). We randomly split subjects into two sets: the derivation set, which included 67% (two-thirds) of the sample (n = 12,233), and the validation set, which included the remaining 33% (one-third) (n = 6089). The two groups’ baseline characteristics were compared using the unpaired t-test or the χ²-test, as appropriate. We developed a risk prediction model from the derivation data using the multivariable Cox proportional hazards model and assessed the goodness of fit using the validation data.

Collinearity among the variables was tested using the variance inflation factor (VIF) with a threshold of 2.5³⁰. From the list of candidate variables, highly correlated variables were excluded based on VIF before applying the model.

The univariate Cox proportional hazards model was applied first to screen the variables for a significant association (p < 0.20)³¹ with hypertension incidence in the derivation set. Variables identified as significant in the univariate association were later put into a multivariable Cox proportional hazards model to determine ultimate significant risk factors (p < 0.05) of incident hypertension. The interaction terms were also tested, with significant variables identified in the multivariable Cox model. During the model development process, the proportional hazard assumption associated with the Cox model was also tested. There are several methods for verifying proportionality assumption, and we tested the proportionality assumption by using the Schoenfeld and scaled Schoenfeld residuals. We tested the proportionality of the model as a whole and proportionality for each predictor.

The following general equation was used to calculate the risk of incident hypertension within time $t$:

$$ Probability = 1 - S_{0} \left( t \right)^{{{\text{exp}}\left( {\mathop \sum \limits_{i = 1}^{p} \beta_{i} X_{i} - \mathop \sum \limits_{i = 1}^{p} \beta_{i} \overline{X}_{i} } \right)}} $$

where $S_{0} \left( t \right)$ is the baseline survival function, assuming all variables are represented by average values at follow-up time $t$; $\beta_{i}$ is the estimated regression coefficient of the $i$th variable; $X_{i}$ is the value of the $i$th variable; $\overline{X}_{i}$ is the corresponding mean, and $p$ denotes the number of variables.

In the validation data, the model’s predictive performance was assessed. Model discrimination was evaluated using Harrell’s C-statistic³². Harrel’s C-statistic indicates the proportion of all pairs of subjects that can be ordered such that the subject who survived longer will have the higher predicted survival time than the subjects who survived shorter, assuming that these subject pairs are selected at random. Calibration was assessed using the Grønnesby and Borgan (GB) test³³. The GB test is an overall goodness-of-fit test for the Cox proportional hazards model and is based on martingale residuals. In the GB test, the observations are divided into K groups according to their estimated risk score, an approach similar to Hosmer and Lemeshow goodness-of-fit for logistic regression³⁴. Brier score was calculated at different time points, and a calibration plot was also used for assessing calibration. In a calibration plot, expected probabilities (predicted probabilities from the model) are plotted against observed outcome probabilities (calculated by Kaplan–Meier estimates). Arjas like plots were used for assessing the goodness of fit graphically³⁵. We also produced histograms of the prognostic index (a linear predictor of the Cox model) to show the prognostic index distribution in the derivation and validation data set. We also assessed calibration using the approach proposed by Royston P³⁶, where observed (Kaplan–Meier) and predicted survival probabilities compared in some prognostic groups derived by placing cut points on the prognostic index. We defined three risk groups (good, intermediate, and poor) from the 25th and 75th centiles of the prognostic index in the derivation dataset based on events.

We then created a point-based scoring system from the model so that it can be easily used in clinical practice. Integer points were assigned according to the presence/absence of each risk factor so that the overall risk can be estimated by summing the points together. We constructed the risk score utilizing the regression coefficients of our Cox model according to the method proposed by Sullivan et al.³⁷. To facilitate calculating risk score, continuous variables considered in the model development were divided into categories as discussed before.

All statistical tests were two-sided. All statistical analyses were performed using Stata (Version 15.1; Stata Corporation, College Station, Texas 77845, USA).

Comparing existing model performances to the developed model

We used a few existing hypertension risk models in our dataset to explain how our developed model performed in comparison to those models when those were applied to our population. Model selection was primarily made based on the availability of the final variables considered in those selected models in our dataset. This eliminated the majority of existing models from consideration for validation in our data set. In addition, whether the model provided enough information to perform the validation was also considered a major factor in selecting a model. For example, if a model did not provide regression coefficients or hazard or odds ratios from which coefficients can be derived were excluded from consideration. Also, if a model did not provide the predictive performance of their models, such as discrimination or calibration, they were excluded from considerations. In this case, we won’t be able to compare the validated model’s predictive performance in our dataset. Considering the aforementioned factors and information from our recent systematic review³⁸, we selected the models by Parikh et al.¹⁵, Kivimӓki et al.³⁹, Lim et al.¹⁰, Chien et al.¹⁴, and Wang et al.¹² for validation in our dataset. The model by Parikh et al.¹⁵, also known as the Framingham Risk Score (FRS), was developed in the United States in a predominantly White population, with age, sex, systolic blood pressure (SBP), diastolic blood pressure (DBP), BMI, parental hypertension, cigarette smoking, and age by DBP as final variables. Kivimӓki et al.³⁹ developed the Whitehall II Risk Score in England in a predominantly White population. In model construction, the same FRS variables were used. Lim et al.¹⁰ developed their model in Korea in an Asian population using the same variables as FRS. Chien et al.¹⁴ developed their model in Taiwan among the ethnic Chinese population. Two models were created, and we validated their clinical model using age, gender, BMI, SBP, and DBP as the final variables. Wang et al.¹² developed their model in China with a rural Chinese population. The final variables in the model were age, parental hypertension, SBP, DBP, BMI, and age by BMI. The final variables considered in these models were available in our dataset.

This study’s ethics was approved by the Conjoint Health Research Ethics Board (CHREB) at the University of Calgary, and all methods were performed in accordance with the relevant guidelines and regulations. Informed consent was waived by the CHREB (REB18-0162_REN2) because the dataset used in this study consisted of de-identified secondary data released for research purposes.

Patient consent

Not required. The manuscript is based on the analysis of secondary de-identified data. Patients and the public were not involved in the development, design, conduct or reporting of the study.

Results

Baseline characteristics of the study participants are presented in Table 1 and supplementary table (Table S2). In Table 1, the study participants’ characteristics are given for the entire cohort as well as compared according to the status of developing hypertension. In contrast, in Table S2, characteristics are compared between the derivation sample and the validation sample. Overall, the study participants’ mean age was 50.99 years, and the participation of females (68.55%) in the study was higher than the males (31.45%). During the median 5.8-year follow-up, 625 (3.41%) participants developed hypertension. In Table 1, most of the study characteristics were significantly different (p < 0.05) between those who developed hypertension and those who did not. Those who developed hypertension were relatively older, had higher (average) BMI, DBP, SBP, and more with diabetes and cardiovascular disease. The proportions of males and females were also significantly different between these two groups. However, some study characteristics were similar with no statistically significant difference (p > 0.05), including ethnicity, family history of hypertension, alcohol consumption, and total physical activity time. When we randomly divided the data into derivation and validation sets (Table S2), the study characteristics were similar with no significant difference (p < 0.05) between the derivation and validation sample except BMI waist ratio.

Table 1 Baseline characteristics of study participants according to the status of developing hypertension or not.

Full size table

From the list of candidate variables, six (ever smoked, hip circumference, body fat percentage, BMI waist ratio, waist circumference, diastolic blood pressure.) were excluded from the model building due to their high collinearity (threshold VIF > 2.5) with other variables. Comparing the study characteristics between the missing and imputed is presented in the supplementary table (Table S3).

In the derivation sample, most of the candidate variables used in our study were identified as significant univariate predictors (Table 2). Variables not significantly associated with incident hypertension in univariate models were excluded from the multivariable model. In the multivariable model, age, sex, BMI, SBP, diabetes, CVD, total physical activity time, depression, waist-hip ratio, residence, highest education level completed, working status, total household income, family history of hypertension, smoking status, total sleep time, vegetable and fruit consumption, and job schedule was included. The multivariable Cox model indicated that age, BMI, SBP, diabetes, total physical activity time, and cardiovascular disease were independent risk factors of incident hypertension (Table 2). We forced sex into the model, considering its clinical importance. The following interaction terms were added to the model with other significant variables in the multivariable Cox model: age by BMI, age by SBP, age by diabetes, age by CVD, age by total physical activity time, age by sex, BMI by sex, SBP by sex, diabetes by sex, CVD by sex, and total physical activity time by sex. When the interaction terms were included in the model, age by sex, age by BMI, age by SBP, age by total physical activity time, sex by SBP, and sex by CVD showed significant association with incident hypertension (Table 3). However, the inclusion of these interaction terms did not improve the models’ discriminative performance. The models with and without interaction terms were virtually identical regarding their Harrel’s C-statistics value (0.77 and 0.77, respectively) and statistical significance (p = 0.64). Consequently, the interaction terms were excluded from the finally selected model. The model with only main effects was used in subsequent analyses to construct a simpler and more user-friendly risk estimation equation and risk score. A global test for Cox proportional hazards assumption indicated no violation of assumptions (p = 0.72) (Supplementary Table S4). The baseline survival function at median follow-up time 5.80-years ≈ 6-years, $S_{0} \left( 6 \right)$ was (0.977). In the derivation sample, the model’s discriminative performance (Harrel’s C-statistic) was 0.77.

Table 2 Unadjusted and adjusted hazard ratios for the risk factors of hypertension incidence.

Full size table

Table 3 Regression coefficients and hazard ratios for incident hypertension.

Full size table

When we applied our derived model in the validation sample, the model’s discriminative performance was good (Harrel’s C-statistic 0.77). The results of the GB test indicated an acceptable calibration of the risk prediction model ($\chi^{2}$ statistic 8.75, p = 0.07, Fig. 1). To compare the observed and expected events in each group based on risk score, Arjas like plots are also presented (Fig. 2). A calibration plot of our prediction model at a time of 6-years was also presented in Fig. 3. A calibration slope of 1.006 indicates that predicted probabilities do not vary enough⁴⁰. Figure 4 represents the calibration of our model in the derivation and validation datasets. The calibration of the model looks good in each dataset. The predictions in the validation dataset are good for both “Good” and “Intermediate” risk groups where survival and predicted probabilities are quite similar, except slightly higher predictions between 6- and 14-years time intervals for the “Intermediate” group. The predictions in the “Poor” group are consistent with the survival up to year six and somewhat high later; that is, survival tends to be worse than predicted. Due to fewer validation data events, the confidence intervals tend to be wider in validation data than in the derivation data. Figure 5 presents the prognostic index histogram in derivation and validation data, and no obvious irregularities and outliers were detected. Brier score calculated at 4-year, 5-year, 6-year, and 7-year time points are 0.018, 0.021, 0.026, and 0.029, respectively indicating accurate predictions.

Finally, from the developed model, a simple and practical risk score was created to calculate the risk of incident hypertension at different times (2-year, 3-year, 5-year, and 6-year) (Table 4). The constant for the points system or the number of regression units that will correspond to one point was set as the risk associated with a 5-year increase in age. To score a continuous variable, the range of possible values of the variable was divided into appropriate categories to enable the allocation of points to the selected categories. To determine the reference values for the open-ended categories (e.g., <or>), we used the 1st percentile and the 99th percentile of that variable to minimize the influence of extreme values. The points were initially computed as a decimal value, but later rounded to the nearest integer for facile calculation. The approximate risk of incident hypertension was then estimated via summation of the points awarded to each of the items. We attach the risks associated with each point total using the Cox regression equation (Table 5). Finally, we created risk categories according to the total points. In our model, the maximum total point is 40, and the minimum is − 2. For simple interpretation in a clinical setting, we categorize estimated risk into three categories and presented in Table 6.

Table 4 Calculation of point values for risk score

Full size table

Table 5 Risk estimates for point totals at 2, 3, 5, and 6-year time.

Full size table

Table 6 Risk categories based on total points.

Full size table

Case study

A 50-year-old male with BMI 28.5, SBP 135, diabetic, no CVD, and moderate physical activity (850 MET minutes/week).

Risk factor	Value	Points
Age	50	2
Sex	Male	0
BMI	28.5	3
SBP	135	10
Diabetes status	Yes	4
CVD status	No	0
Physical activity	Moderate (850 MET minutes/week)	− 1
Point total		18
The estimate of risk (6-year)		7.31

The risk estimate based on our newly developed Cox model is computed as follows:

$$ \begin{aligned} \mathop \sum \limits_{i = 1}^{7} \beta_{i} X_{i} & = 0.02768\left( {50} \right) + 0.08722\left( 0 \right) + 0.05147\left( {28.5} \right) + 0.04629\left( {135} \right) \\ & \quad + 0.57066\left( 1 \right) + 1.08710\left( 0 \right) - 0.00003\left( {850} \right) = 9.645205 \\ \mathop \sum \limits_{i = 1}^{7} \beta_{i} \overline{X}_{i} & = 0.02768\left( {50.94} \right) + 0.08722\left( {0.3142} \right) + 0.05147\left( {26.48} \right) + 0.04629\left( {119.75} \right) \\ & \quad + 0.57066\left( {0.041} \right) + 1.08710\left( {0.021} \right) - 0.00003\left( {3157.97} \right) = 8.2950638 \\ \hat{p} & = 1 - S_{0} \left( t \right)^{{\exp \left( {\mathop \sum \limits_{i = 1}^{7} \beta_{i} X_{i} - \mathop \sum \limits_{i = 1}^{7} \beta_{i} \overline{X}_{i} } \right)}} = 1 - 0.977^{{\exp \left( {9.645205 - 8.2950638} \right)}} = 0.085 \\ \end{aligned} $$

The points system gives a 6-year risk of incident hypertension of 7.3%, while employing the Cox model directly gives an estimate of 8.5%.

Existing models' performances in our dataset

We compared the models’ predictive performance using the most commonly reported predictive performance metric, the C-statistic. Table 7 shows the C-statistics from the original model and the C-statistics when the models were applied to our dataset. All models’ performances were lower in our population than their original predictive performance. Figure 6 compares our model’s predictive performance (C-statistic) to the five validated models. Our model’s better predictive performance was observed, which supports the creation of our new prediction model for the Canadian population.

Table 7 The predictive performance of some of the past published hypertension prediction models in our dataset.

Full size table

Discussion

In this large prospective cohort study, we developed a simple model to predict the risk of developing hypertension incidence in Canadian adults. The variables included in our model (age, sex, SBP, BMI, diabetes, cardiovascular disease, and self-reported total physical activity time) are routinely and easily assessed in the primary-care clinical setting. Our prediction model for hypertension risk had very good discrimination and calibration for both the derivation and validation samples, suggesting that this model has good performance and may perform well when applied to a different Canadian population. Also, a risk score table was derived for clinical implementation and workability of the developed model. Derived point-based score where points assigned to each variable is easy to administer by health care professionals and the general population and can guide clinical counseling and decision making.

The predictive performance of our model was similar to other studies. Although prediction models’ performance varies considerably across studies, our recent meta-analysis on the predictive performance of hypertension risk prediction models indicates an overall pooled C-statistic of 0.75 [95% CI: 0.73–0.77]⁴¹, which justifies our model’s good predictive performance. Framingham hypertension risk score¹⁵, the most validated hypertension risk prediction model, had a C-statistic of 0.78, similar to our model. Our model’s calibration was also right on several performance measures.

Most of the variables included in our final model are consistent with other previous studies (Supplementary Fig. S1). The variable sex was not identified as a significant factor in our model, but we forced it into the model considering its clinical implication⁴². Diabetes and CVD were the two significant risk factors in our model, often excluded by many studies. Individuals who have diabetes or CVD have a higher risk of developing hypertension than those free of these conditions. Our risk prediction model aimed to identify the risk factors for hypertension in adults but excluding people with diabetes and CVD would limit our results’ generalizability. To develop a risk prediction model applicable to as many individuals as possible, we considered diabetes and CVD subjects in model building. Smoking, alcohol consumption, and family history of hypertension are common risk factors used in the past hypertension risk prediction models (Supplementary Fig. S1). In our study, these risk factors were not identified as significant. Their inclusion in the model also did not change the model’s discriminative performance (Harrel’s C-statistic remains the same as 0.77). We identified total physical activity time significantly contributes to our model. This finding is significant because exercise is considered a preventive factor for hypertension incidence supported by scientific evidence⁴³. Moreover, it is a highly modifiable lifestyle factor, and physical activity changes can modify the status of hypertension incidence.

We assessed interaction effects in our model, and several of the interaction terms were identified as significant. However, inclusion of interaction terms in the model did not improve the model’s predictive performance. Our focus was on generating a simple and user-friendly risk scoring algorithm avoiding complexity. As a result, the interaction terms were excluded from the model in final considerations.

To our knowledge, this is the first hypertension risk prediction model developed explicitly in a Canadian population. The model was created using a large sample size, and the estimates from our prediction models were found to be stable, as demonstrated in the internal validation. Further, consideration of many candidate variables in model building is also a strength of this study. In contrast to most studies, where models were developed in complete cases, excluding those with missing values, we imputed missing values in our study. This approach prevented information loss, maximized information utilization, and made the results robust.

We could have used an existing model and evaluated its performance through external validation in our dataset before creating a new risk score. However, we refrain from doing this for the following reasons: First, when applied to new individuals, a prediction model typically performs worse than it did with its original study population^18,44. When a low predictive accuracy is discovered after an external validation study, researchers must decide whether to reject the model or update it to improve its predictive accuracy. By combining information captured in the original model with information from new individuals from the validation study, the model can be updated or recalibrated for local circumstances¹⁸. Model updating entails adding more predictors or altering a portion of the formula to better suit the external population¹⁸. The appropriateness of model updating during external validation is a point of contention among researchers. Some claim that the researchers are developing a new prediction model even with minor changes^18,45,46. Second, developing a new prediction model along with externally validating a well-known existing prediction model in the development cohort and concluding that the new model performs better is an inappropriate comparison in our view. Because this is then comparing the performance of one model in development to the performance of another model in external validation¹⁸. The newly developed model will almost always appear superior because it is optimally designed to fit the development data¹⁸. The performance of two existing prediction models should be directly compared in an external validation dataset that is independent of both model development cohorts. Given this, we did not evaluate an existing model's performance and then develop a new model on the same dataset. Nevertheless, for the purpose of comparison, we assessed a few of the published hypertension risk prediction models in our population and found that their performance was inferior to ours.

Our study has several limitations. Study participants were middle-aged Canadians. Prevention strategies are likely to be more effective if the young population can be targeted. Nevertheless, our study participants’ age range will likely have minimal impact on our study’s generalizability, as essential hypertension develops in the middle aged adults⁴⁷, as represented here. At baseline, we excluded participants with self-reported hypertension, which can potentially lead to misclassification of hypertension status. The incidence rate of hypertension in our study was relatively low compared to what is reported for the general Alberta population⁴⁸. There can be several potential reasons for that. The characteristics of the study participants in ATP may be different from the general Alberta population. For example, female participation in ATP data was more than double the male participation (69% vs. 31%), and the hypertension incidence rate in Alberta was much lower in females than the males in study age groups⁴⁸. A potential selection bias also may lead to a lower incidence rate of hypertension in our study The participants in ATP were mainly selected using the volunteer sampling method⁴⁹. Those who decided to join the study (i.e., who self-select into the survey) may have a different characteristic (e.g., healthier) than the non-participants. Due to the longitudinal nature of the study, there can also be a loss of study participants during follow-up. Participants who were lost to follow-up (e.g., due to emigration out of the province) may be more likely to develop hypertension. Our study ascertained outcome hypertension from a linked administrative health data (the hospital discharge abstract or physician claims data source) due to a lack of longitudinal data in ATP. There is a possibility that the outcome ascertainment was incomplete as we did not have measured blood pressure to verify. Also, people who did not have a healthcare encounter after cohort enrollment (e.g., did not visit a family physician/general practitioner or were not admitted to the hospital during the study period) were missed and can potentially lead to a lower hypertension incidence. We did not account for competing risks in our study because the expected event (death) rate is low as the cohort was healthy and relatively young at inception with a short follow-up time. We did not include genetic risk factors or biomarkers in our model. The inclusion of genetic risk factors in the model has the potential of improving risk prediction. However, our recent meta-analysis on hypertension risk prediction models⁴¹ and previous studies¹¹ did not show any differences in discriminative performance (pooled C-statistic was 0.76 for models developed using genetic risk factors/biomarkers). In addition, the inclusion of genetic risk factors in the model may decrease the prediction model’s application in routine clinical practice. Sodium intake is an important dietary factor for the risk of incident hypertension; however, in our study, sodium intake data were not available. We could not perform an external validation of our model, essential for any prediction model’s generalizability. Therefore, further validation of our model in other populations, particularly in another Canadian jurisdiction, is warranted. A direct comparison of our models' performance with other models on the same dataset will allow us to properly understand the quality of our new model, allowing for a head-to-head comparison of predictive performance between models. The direct comparison will show which model performs best, which will help guide future research and clinical practice. In the future, we plan to compare our model to other relevant models on a separate dataset.

In conclusion, we have developed a simple yet practical prediction model to estimate the risk of incident hypertension for the Canadian population. Risk assessment tools are believed to be convenient in motivating high-risk individuals for future health problems to modify their lifestyles to decrease their risks. Once the model is validated via external validation studies, it can help identify individuals at higher risk of hypertension, increase health consciousness, motivate individuals to improve their lifestyles and prevent or delay the onset of hypertension.

Data availability

The data that support the findings of this study are available from Alberta’s Tomorrow Project (ATP) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Alberta’s Tomorrow Project (ATP).

References

Leung, A. A., Williams, J. V. A., McAlister, F. A., Campbell, N. R. C. & Padwal, R. S. Worsening hypertension awareness, treatment, and control rates in Canadian women between 2007 and 2017. Can. J. Cardiol. 36(5), 732–739. https://doi.org/10.1016/j.cjca.2020.02.092 (2020).
Article PubMed Google Scholar
Bromfield, S. & Muntner, P. High blood pressure: The leading global burden of disease risk factor and the need for worldwide prevention programs. Curr. Hypertens. Rep. 15(3), 134–136. https://doi.org/10.1007/s11906-013-0340-9 (2013).
Article PubMed PubMed Central Google Scholar
Nerenberg, K. A. et al. Hypertension Canada’s 2018 guidelines for diagnosis, risk assessment, prevention, and treatment of hypertension in adults and children. Can. J. Cardiol. https://doi.org/10.1016/j.cjca.2018.02.022 (2018).
Article PubMed Google Scholar
Leung, A. A., Bushnik, T., Hennessy, D., McAlister, F. A. & Manuel, D. G. Risk factors for hypertension in Canada. Heal Rep. 30(2), 1–13 (2019).
Google Scholar
Chowdhury, M. Z. I. & Turin, T. C. Precision health through prediction modelling: Factors to consider before implementing a prediction model in clinical practice. J. Prim. Health Care 12(1), 3–9. https://doi.org/10.1071/HC19087 (2020).
Article PubMed Google Scholar
Chowdhury, M. Z. I. & Turin, T. C. Validating prediction models for use in clinical practice: Concept, steps, and procedures focusing on hypertension risk prediction. Hypertens. J. 7(1), 54–62. https://doi.org/10.15713/ins.johtn.0221 (2021).
Article Google Scholar
Chowdhury, M. Z. I., Yeasmin, F., Rabi, D. M., Ronksley, P. E. & Turin, T. C. Predicting the risk of stroke among patients with type 2 diabetes: A systematic review and meta-analysis of C-statistics. BMJ Open https://doi.org/10.1136/bmjopen-2018-025579 (2019).
Article PubMed PubMed Central Google Scholar
Kanegae, H., Oikawa, T., Suzuki, K., Okawara, Y. & Kario, K. Developing and validating a new precise risk-prediction model for new-onset hypertension: The Jichi Genki hypertension prediction model (JG model). J. Clin. Hypertens. 20(5), 880–890. https://doi.org/10.1111/jch.13270 (2018).
Article CAS Google Scholar
Otsuka, T. et al. Development of a risk prediction model for incident hypertension in a working-age Japanese male population. Hypertens. Res. 38(6), 419–425. https://doi.org/10.1038/hr.2014.159 (2015).
Article PubMed Google Scholar
Lim, N. K., Son, K. H., Lee, K. S., Park, H. Y. & Cho, M. C. Predicting the risk of incident hypertension in a Korean middle-aged population: Korean genome and epidemiology study. J. Clin. Hypertens. 15(5), 344–349. https://doi.org/10.1111/jch.12080 (2013).
Article Google Scholar
Paynter, N. P. et al. Prediction of incident hypertension risk in women with currently normal blood pressure. Am. J. Med. 122(5), 464–471. https://doi.org/10.1016/j.amjmed.2008.10.034 (2009).
Article PubMed PubMed Central Google Scholar
Wang, B. et al. Prediction model and assessment of probability of incident hypertension: The Rural Chinese cohort study. J. Hum. Hypertens. https://doi.org/10.1038/s41371-020-0314-8 (2020).
Article PubMed Google Scholar
Kadomatsu, Y. et al. A risk score predicting new incidence of hypertension in Japan. J. Hum. Hypertens. 33(10), 748–755. https://doi.org/10.1038/s41371-019-0226-7 (2019).
Article PubMed Google Scholar
Chien, K. L. et al. Prediction models for the risk of new-onset hypertension in ethnic Chinese in Taiwan. J. Hum. Hypertens. 25(5), 294–303. https://doi.org/10.1038/jhh.2010.63 (2011).
Article PubMed Google Scholar
Parikh, N. I. et al. A risk score for predicting near-term incidence of hypertension: The Framingham heart study. Ann. Intern Med. https://doi.org/10.7326/0003-4819-148-2-200801150-00005 (2008).
Article PubMed Google Scholar
Giampaoli, S., Palmieri, L., Mattiello, A. & Panico, S. Definition of high risk individuals to optimise strategies for primary prevention of cardiovascular diseases. Nutr. Metab. Cardiovasc. Dis. https://doi.org/10.1016/j.numecd.2004.12.001 (2005).
Article PubMed Google Scholar
Chowdhury, M. Z. I., Yeasmin, F., Rabi, D. M., Ronksley, P. E. & Turin, T. C. Prognostic tools for cardiovascular disease in patients with type 2 diabetes: A systematic review and meta-analysis of C-statistics. J. Diabetes Complicat. 33(1), 98–111. https://doi.org/10.1016/j.jdiacomp.2018.10.010 (2019).
Article Google Scholar
Ramspek, C. L., Jager, K. J., Dekker, F. W., Zoccali, C. & van Diepen, M. External validation of prognostic models: What, why, how, when and where?. Clin. Kidney J. 14(1), 49–58. https://doi.org/10.1093/ckj/sfaa188 (2021).
Article PubMed Google Scholar
Riley, R. D. et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: Opportunities and challenges. BMJ 353, 27–30. https://doi.org/10.1136/bmj.i3140 (2016).
Article Google Scholar
Moons, K. G. M. et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Explanation and elaboration. Ann. Intern. Med. 162(1), W1–W73. https://doi.org/10.7326/M14-0698 (2015).
Article PubMed Google Scholar
Altman, D. G., Vergouwe, Y., Royston, P. & Moons, K. G. M. Prognosis and prognostic research: Validating a prognostic model. BMJ https://doi.org/10.1136/bmj.b605 (2009).
Article PubMed PubMed Central Google Scholar
Altman, D. G. & Royston, P. What do we mean by validating a prognostic model?. Stat. Med. 19, 453–473 (2000).
Article CAS Google Scholar
Robson, P. J. et al. Design, methods and demographics from phase I of Alberta’s Tomorrow Project cohort: A prospective cohort profile. C Open 4(3), E515–E527. https://doi.org/10.9778/cmajo.20160005 (2016).
Article Google Scholar
Sun, D. et al. Recent development of risk-prediction models for incident hypertension: An updated systematic review. PLoS ONE 12(10), 1–19. https://doi.org/10.1371/journal.pone.0187240 (2017).
Article CAS Google Scholar
Echouffo-Tcheugui, J. B., Batty, G. D., Kivimäki, M. & Kengne, A. P. Risk models to predict hypertension: A systematic review. PLoS ONE https://doi.org/10.1371/journal.pone.0067370 (2013).
Article PubMed PubMed Central Google Scholar
Quan, H. et al. Validation of a case definition to define hypertension using administrative data. Hypertension https://doi.org/10.1161/HYPERTENSIONAHA.109.139279 (2009).
Article PubMed Google Scholar
Kang, H. The prevention and handling of the missing data. Korean J. Anesthesiol. 64(5), 402–406. https://doi.org/10.4097/kjae.2013.64.5.402 (2013).
Article PubMed PubMed Central Google Scholar
Sinharay, S., Stern, H. S. & Russell, D. The use of multiple imputation for the analysis of missing data. Psychol. Methods 6(3), 317–329. https://doi.org/10.1037/1082-989x.6.4.317 (2001).
Article CAS PubMed Google Scholar
Royston, P. & White, I. R. Multiple imputation by chained equations (MICE): Implementation in stata. J. Stat. Softw. 45(4), 1–20 (2011).
Article Google Scholar
Midi, H., Sarkar, S. K. & Rana, S. Collinearity diagnostics of binary logistic regression model. J. Interdiscip. Math. 13(3), 253–267. https://doi.org/10.1080/09720502.2010.10700699 (2010).
Article MATH Google Scholar
Chowdhury, M. Z. I. & Turin, T. C. Variable selection strategies and its importance in clinical prediction modelling. Fam. Med. Community Heal https://doi.org/10.1136/fmch-2019-000262 (2020).
Article Google Scholar
Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA J. Am. Med. Assoc. 247(18), 2543–2546. https://doi.org/10.1001/jama.1982.03320430047030 (1982).
Article Google Scholar
Grønnesby, J. K. & Borgan, Ø. A method for checking regression models in survival analysis based on the risk score. Lifetime Data Anal. https://doi.org/10.1007/bf00127305 (1996).
Article PubMed MATH Google Scholar
Hosmer, D. W. & Lemeshow, S. Goodness of fit tests for the multiple logistic regression model. Commun. Stat. Theory Methods 9(10), 1043–1069. https://doi.org/10.1080/03610928008827941 (1980).
Article MATH Google Scholar
Arjas, E. A graphical method for assessing goodness of fit in Cox’s proportional hazards model. J. Am. Stat. Assoc. 83(401), 204–212. https://doi.org/10.1080/01621459.1988.10478588 (1988).
Article Google Scholar
Royston, P. Tools for checking calibration of a Cox model in external validation: Prediction of population-averaged survival curves based on risk groups. Stata J. 15(1), 275–291. https://doi.org/10.1177/1536867x1501500116 (2015).
Article Google Scholar
Sullivan, L. M., Massaro, J. M. & D’Agostino, R. B. Presentation of multivariate data for clinical use: The Framingham Study risk score functions. Stat. Med. 23(10), 1631–1660. https://doi.org/10.1002/sim.1742 (2004).
Article PubMed Google Scholar
Chowdhury, M. Z. I. et al. Prediction of hypertension using traditional regression and machine learning models: A systematic review and meta-analysis. PLoS ONE 17(4), e0266334. https://doi.org/10.1371/journal.pone.0266334 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kivimäki, M. et al. Validating the Framingham hypertension risk score: Results from the Whitehall II study. Hypertension 54(3), 496–501. https://doi.org/10.1161/HYPERTENSIONAHA.109.132373 (2009).
Article CAS PubMed Google Scholar
Stevens, R. J. & Poppe, K. K. Validation of clinical prediction models: What does the “calibration slope” really measure?. J. Clin. Epidemiol. 118, 93–99. https://doi.org/10.1016/j.jclinepi.2019.09.016 (2020).
Article PubMed Google Scholar
Chowdhury, M. Z. I. Develop a comprehensive hypertension prediction model and risk score in population-based data applying conventional statistical and machine learning approaches. Published online 2021. https://doi.org/10.11575/PRISM/38706.
Ramirez, L. A. & Sullivan, J. C. Sex differences in hypertension: Where we have been and where we are going. Am. J. Hypertens. 31(12), 1247–1254. https://doi.org/10.1093/ajh/hpy148 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kshirsagar, A. V. et al. A hypertension risk score for middle-aged and older adults. J. Clin. Hypertens. 12(10), 800–808. https://doi.org/10.1111/j.1751-7176.2010.00343.x (2010).
Article Google Scholar
Chowdhury, M. Z. I. et al. Summarising and synthesising regression coefficients through systematic review and meta-analysis for improving hypertension prediction using metamodelling: Protocol. BMJ Open https://doi.org/10.1136/bmjopen-2019-036388 (2020).
Article PubMed PubMed Central Google Scholar
Riley, R. D. et al. Minimum sample size for developing a multivariable prediction model: PART II—Binary and time-to-event outcomes. Stat. Med. 38(7), 1276–1296. https://doi.org/10.1002/sim.7992 (2019).
Article MathSciNet PubMed Google Scholar
Moons, K. G. M. et al. Risk prediction models: II. External validation, model updating, and impact assessment. Heart 98(9), 691–698. https://doi.org/10.1136/heartjnl-2011-301247 (2012).
Article PubMed Google Scholar
Hajjar, I. & Kotchen, T. A. Trends in prevalence, awareness, treatment, and control of hypertension in the United States, 1988–2000. J. Am. Med. Assoc. https://doi.org/10.1001/jama.290.2.199 (2003).
Article Google Scholar
Interactive Health Data Application—Display Results. Accessed March 29, 2021. http://www.ahw.gov.ab.ca/IHDA_Retrieval/selectSubCategoryParameters.do.
Ye, M. et al. Cohort profile: Alberta’s Tomorrow Project. Int. J. Epidemiol. 46(4), 1097–1098l. https://doi.org/10.1093/ije/dyw256 (2017).
Article PubMed Google Scholar

Download references

Acknowledgements

Alberta’s Tomorrow Project is only possible because of the commitment of its research participants, its staff, and its funders: Alberta Health, Alberta Cancer Foundation, Canadian Partnership Against Cancer and Health Canada, and substantial in-kind funding from Alberta Health Services. The views expressed herein represent the views of the author(s) and not of Alberta’s Tomorrow Project or any of its funders.

Funding

This research received no Grant from any funding agency in the public, commercial or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Community Health Sciences, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N 4Z6, Canada
Mohammad Ziaul Islam Chowdhury, Alexander A. Leung, Hude Quan & Tanvir C. Turin
Department of Family Medicine, G012F, Health Sciences Centre, Cumming School of Medicine, University of Calgary, 3330 Hospital Drive NW, Calgary, AB, T2N 4N1, Canada
Mohammad Ziaul Islam Chowdhury, Maeve O’Beirne & Tanvir C. Turin
Department of Psychiatry, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N 4Z6, Canada
Mohammad Ziaul Islam Chowdhury
Department of Medicine, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N 4Z6, Canada
Alexander A. Leung
Health Status Assessment, Surveillance and Reporting, Public Health Surveillance and Infrastructure, Provincial Population and Public Health, Alberta Health Services, 10101 Southport Rd. SW, Calgary, AB, T2W 3N2, Canada
Khokan C. Sikdar

Authors

Mohammad Ziaul Islam Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Alexander A. Leung
View author publications
You can also search for this author in PubMed Google Scholar
Khokan C. Sikdar
View author publications
You can also search for this author in PubMed Google Scholar
Maeve O’Beirne
View author publications
You can also search for this author in PubMed Google Scholar
Hude Quan
View author publications
You can also search for this author in PubMed Google Scholar
Tanvir C. Turin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.Z.I.C., H.Q., and T.C.T. contributed to the conception and design of the study. M.Z.I.C. performed the analysis. M.Z.I.C. drafted the manuscript and A.A.L., H.Q., K.C.S., M.O. and T.C.T. critically reviewed it and suggested amendments prior to submission. All authors approved the final version of the manuscript and took responsibility for the integrity of the reported findings.

Corresponding author

Correspondence to Tanvir C. Turin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chowdhury, M.Z.I., Leung, A.A., Sikdar, K.C. et al. Development and validation of a hypertension risk prediction model and construction of a risk score in a Canadian population. Sci Rep 12, 12780 (2022). https://doi.org/10.1038/s41598-022-16904-x

Download citation

Received: 31 January 2022
Accepted: 18 July 2022
Published: 27 July 2022
DOI: https://doi.org/10.1038/s41598-022-16904-x

This article is cited by

Development of risk models of incident hypertension using machine learning on the HUNT study data
- Filip Emil Schjerven
- Emma Maria Lovisa Ingeström
- Frank Lindseth
Scientific Reports (2024)
A comparison of machine learning algorithms and traditional regression-based statistical modeling for predicting hypertension incidence in a Canadian population
- Mohammad Ziaul Islam Chowdhury
- Alexander A. Leung
- Tanvir C. Turin
Scientific Reports (2023)
Building a predictive model for hypertension related to environmental chemicals using machine learning
- Shanshan Liu
- Lin Lu
- Qiang Zeng
Environmental Science and Pollution Research (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Prediction model and assessment of probability of incident hypertension: the Rural Chinese Cohort Study

Development of a risk prediction score for hypertension incidence using Japanese health checkup data

Development of a risk prediction model for incident hypertension in Japanese individuals: the Hisayama Study

Introduction

Methods

Study population

Selection of candidate variables

Definition of outcome and variables

Missing values

Statistical analysis

Comparing existing model performances to the developed model

Patient consent

Results

Case study

Existing models' performances in our dataset

Discussion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Development of risk models of incident hypertension using machine learning on the HUNT study data

A comparison of machine learning algorithms and traditional regression-based statistical modeling for predicting hypertension incidence in a Canadian population

Building a predictive model for hypertension related to environmental chemicals using machine learning

Comments

Search

Quick links