Re-calibration of coronary risk prediction: an example of the Seven Countries Study

We aimed at performing a calibration and re-calibration process using six standard risk factors from Northern (NE, N = 2360) or Southern European (SE, N = 2789) middle-aged men of the Seven Countries Study, whose parameters and data were fully known, to establish whether re-calibration gave the right answer. Greenwood-Nam-D’Agostino technique as modified by Demler (GNDD) in 2015 produced chi-squared statistics using 10 deciles of observed/expected CHD mortality risk, corresponding to Hosmer-Lemeshaw chi-squared employed for multiple logistic equations whereby binary data are used. Instead of the number of events, the GNDD test uses survival probabilities of observed and predicted events. The exercise applied, in five different ways, the parameters of the NE-predictive model to SE (and vice-versa) and compared the outcome of the simulated re-calibration with the real data. Good re-calibration could be obtained only when risk factor coefficients were substituted, being similar in magnitude and not significantly different between NE-SE. In all other ways, a good re-calibration could not be obtained. This is enough to praise for an overall need of re-evaluation of most investigations that, without GNDD or another proper technique for statistically assessing the potential differences, concluded that re-calibration is a fair method and might therefore be used, with no specific caution.

The purpose of this analysis was to perform a calibration and re-calibration process using data from two population groups whose parameters and data were all fully known, so that we could establish whether re-calibration gave the right answer. Moreover, we took advantage by the use of a recently proposed technique for assessing statistically calibration outcome that was specifically designed and tested for material consisting of survival data 24 .

Material and Methods
Population samples used for this analysis derived from the Seven Countries Study of Cardiovascular Diseases. We selected two rural sample in Finland (East and West Finland) and one in the city of Zutphen, the Netherlands that were pooled together to represent Northern Europe (NE). Then, we selected two rural samples in Italy (Crevalcore and Montegiorgio) and two rural samples in Greece (Crete and Corfu) that were pooled together to represent Southern Europe (SE). The entry examination was held between 1959 and 1961 with the enrolment of 2555 middle-aged men in NE and 2927 men in SE, representing respectively 93% and 98% of defined samples. For some purposes of the analysis, data of the US sample and the pool of the three Serbian samples of the Seven Countries Study were also used, as described later. Details on the Seven Countries Study can be found elsewhere 25 .
Six risk factors used for this analysis were: a) age in years, approximated to the nearest birthday; b) body mass index in kg/m 2 , derived from the measurement of height and weight following the technique described in the WHO Manual Cardiovascular Survey Methods 26 (WHO Manual); c) smoking habits, derived from a questionnaire, classified in three variables as never smokers, ex-smokers and smokers treated as dummy variables (0-1), with never smokers used as reference in the multivariable models; d) systolic blood pressure in mm Hg, measured following the technique of the WHO Manual 26 , in supine position, with mercury sphygmomanometer: two consecutive measurements were averaged and used for analysis; e) heart rate in beats/min, derived from a resting ECG tracing; f) serum cholesterol in mmol/L, measured on casual blood samples following the technique of Anderson and Keys 27 .
All methods were carried out in accordance with relevant guidelines and regulations at the time of the field examinations, performed in the late 50′ to early 60′. In particular, since baseline data were collected before the era of the Helsinki Declaration (June 1964), with consent implied by participation in the examinations, there were no expressed approvals from institutional or licensing committees. However, there were subsequently verbal or written consents given by all informed participants to collect follow-up data.
Mortality data were collected systematically including the availability of deaths certificates plus other information derived from clinical records, interviews with family and hospital physicians, and relatives of the deceased, and any other witness of the fatal event. Coding was made by a single reviewer, following defined rules and criteria and employing the 8th Revision of the WHO ICD 28 . In case of multiple causes of death (present in about half of cases) and of uncertainties about the principal cause of death, a rank order was adopted with violence, cancer, CHD, stroke and other in sequence. CHD was defined by ICD codes 410-413 and cases of sudden death when other specific causes could be reasonably excluded. Cases described as chronic CHD or hypertensive heart disease and cases manifested only as heart failure, arrhythmia or blocks were not coded as CHD for reasons given elsewhere 29 .
Despite the availability of 50-year follow-up data, this analysis was confined to the first 25 years since usually predictive tools of CHD events do not consider too long follow-up periods. Subjects carrying a CHD at baseline were excluded from analysis, as well as those with missing risk factor measurements, thus reducing the denominators to 2360 in NE and 2789 in SE.
Cox proportional hazards models were computed separately for NE and SE using the six risk factors as predictors and 25-year CHD deaths as end-point. Calibration of the models were made using the procedure of Greenwood-Nam-D' Agostino as modified by Demler 24 (GNDD) that produces chi-squared statistics using 10 deciles of observed and expected (estimated) risk. This approach represents a sophisticated procedure to be used when the material includes survival data and corresponds to the Hosmer-Lemeshaw chi-squared employed for the multiple logistic equation that deals only with binary data. Instead of the number of events, the GNDD test uses survival probabilities of observed and predicted events.
Calibration was performed on the original models (called SELF) separately for NE and SE, by comparing observed and expected risk in decile classes of risk, tested with the GNDD procedure (SELF model). A simulation of re-calibration was carried out for the NE group (validation population) applying the SE (derivation population) risk function components in five different ways, producing five different re-calibrated models: the same was done applying the NE risk function to the SE data. The five re-calibrated models were as follows: (1) model SCM applying the whole SE risk function, including (S) (baseline cumulative survival, S (0) ), C (coefficients) and M (mean of risk factor levels); (2) model CM substituting only C (coefficients) and M (mean of risk factor levels) of the opposite group; The risk probabilities predicted by the models in decile classes of risk were computed and used for the final estimate of the GNDD chi-squared. The above procedures represent an extreme not previously tested and special case (since practically unique) of re-calibration where the parameters of both derivation and validation populations are fully known. Table 1 reports mean levels of risk factors, and numbers and rates of CHD fatal events. Risk factor levels of systolic blood pressure, serum cholesterol, and smoking habits were significantly higher in NE than in SE, while BMI was so in SE versus NE, and no significant difference was found for age and heart rate (although the latter close to significance). Death rate from CHD in 25 years was roughly three-fold higher in NE than in SE.

Results
Cox models for the two groups (Table 2) showed that BMI had not significant coefficient in NE, but significant in SE, while heart rate and ex-smokers had not significant coefficients in both groups. Age, systolic blood pressure, smokers and serum cholesterol had significant coefficients in both groups. The comparison between the two areas showed a large difference in the baseline cumulative survival that contrasted with the absence of significant differences of risk factors coefficients.
An intermediate finding of the GNDD procedure is represented by the estimates of risk probabilities for observed and predicted (estimated) events in decile classes of risk. A selection of these data (Table 3) indicated that the NE re-calibrated risk probabilities were similar to the original ones (SELF), only in the model that substituted C alone, and the same was the case for SE. Again, in NE models SCM and CM produced gross under-and over-estimates of risk, respectively, while the reverse was the true for SE. Also the WC and the ALTS models were miss-calibrated.
In Table 3 the final outcome of calibration and re-calibration tests are expressed as p of chi-squared according to the procedure of GNDD. In both areas, calibration based on the original (SELF) model was good since associated with high p value of the chi squared. The same happened when only the coefficient of the opposite population were applied, while in all other cases under-or over-estimation or risk were the outcome.
Findings of the above analyses are summarized graphically in Fig. 1 where the SCM model determined an under-estimation of risk in NE and an over-estimation of risk in SE. In the case of model CM an over-estimation was seen in NE and an under-estimation in SE. These effects were mainly due to the fact that the magnitude of coefficients in SE was slightly larger than in NE, despite the absence of significant differences between the pairs of coefficients. On the other hand, a good estimation of risk was seen for the SELF (original) model and model C. Finally, the two models with wide coefficients (WC), and altered S (0) (ALTS) were also miss-calibrated.

Discussion
Strengths and limitations. There was no need to produce a re-calibration in these two population groups since for both of them all the data and parameters of the predictive models were fully known, including risk factor levels, identification of events, cumulative survival and multivariable coefficients of the six risk factors. The exercise was directed to apply, in different ways, the parameters of the predictive model of one population to the other (and vice-versa) with the purpose to compare the outcome of the simulated re-calibration with the real data. The main conclusion was that a good re-calibration could be obtained only when we substituted the risk factor coefficients since they were similar in magnitude and not significantly different between the two populations. In all other cases, a good re-calibration could not be obtained. This exercise has several strength points since the two population groups were studied by the same team within the same cooperative study, examining population samples of the same sex and age range and in the same years, using the same risk factor measurement techniques, the same collection of data procedure, the same follow-up duration, the same diagnostic criteria for end-points and the same coding rules. By contrast, the two population groups  had different levels of some major risk factor such as blood pressure, serum cholesterol and smoking habits risk, different rates of CHD mortality, but eventually similar (and not significantly different) multivariable coefficients of risk factors. Moreover, the analysis took advantage from the use of the most recent and sophisticated procedure for estimation of calibration and re-calibration 24 that takes into proper account the presence of survival data, a technique that was not at hand of previous Investigators during the years 2001-2015 when they claimed in different datasets that re-calibration is possible without specific reserves [11][12][13][14][15][16][17][18][19][20][21][22][23] . The only common outcome to all models was an increasing estimated risk from decile 1 to decile 10 which says that the estimate of relative risk could be relatively good, although usually smaller than that provided by the "SELF" (original) models as seen here. It should be made clear that the SCM model was not the outcome of a real re-calibration but simply a test to explore what happened with a re-calibration procedure. Moreover, it was not granted that by applying the coefficients of the opposite population produced a good re-calibration. This probably happened only because the coefficients of the two populations were very similar. In fact, when we applied the co-called Widest coefficients, based on an arbitrary choice (derived however from data of the same study) the re-calibration was not successful, although the coefficients were not significantly different from the original ones. Similarly, the use of slightly different baseline cumulative survival was an arbitrary selection showing however that by little variations of S (0) outcome might be largely influenced.
There are several limitations in this analysis. In fact, it deals only with male sex and with a restricted age range at entry examination. The analysis was limited to few risk factors and to a limited group of population samples, with findings that cannot be automatically extrapolated to other situations. Moreover, it can be argued that our conclusions cannot be transferred to the present days since population risk factor levels are likely different from those of 50 years ago, also due to the more common drug treatment of high blood pressure and high cholesterol levels. However, this is not demonstrated and if the levels attained by drug treatment have a predictive value, we should not necessarily negate a possible general rule of the relationship between risk factors and CHD events. In fact, it is possible that the slope of the relationship should not depend from risk factors distribution.
On the basis of this experience, we feel that the re-calibration process should be carried with much caution owing to a number of uncertainties that include the relationship of population sample risk factors with the events collected from other sources and the magnitude of the risk factor coefficients coming from other population experiences. This is enough to praise for an overall need of re-evaluation of most investigations 11-23 that, without a proper technique for statistically assessing the potential differences 24 , concluded that re-calibration is a fair method and might therefore be used, with no specific caution. In general it appears that in order to have a good re-calibration one needs: A) a very good (and/or very lucky) estimate of baseline cumulative survival S (0) that is probably hard to obtain since usually derived from external data; indeed, the estimate of the baseline cumulative survival, usually derived from sparse and diverse sources, cannot be fully trusted to reflect the reality of the study population; B) a set of "imported" coefficients not significantly different from the theoretical (and unknown) ones, but also very similar in magnitude.
Contributions from the literature. Most of the re-calibration procedures found in the literature [11][12][13][14][15][16][17][18][19][20][21][22][23] started from the assumption that the "imported" coefficients were valid and similar or not significantly different from those (unknown) of the "importing" (validation) populations. These studies were carried in 13 different countries (9 European and two Asian countries, plus Australia and the USA). Several contributions have put together three components derived from largely different sources: risk factor distribution from population samples, mortality data from regional or national origin (not related to a real follow-up) and coefficients from the Euro-Score project or the Framingham Heart Study 14,15,[17][18][19][20][21] . Only three studies 11,12,22 have reported the risk function of the "validation population", while others did not do so although it was clear that it could have been computed and presented 13,16,19,20,23 . This was particularly unfortunate since a good occasion was lost to investigate the problem of the possible homogeneity of risk factor coefficients. Some contributions have openly claimed that the possible heterogeneity of multivariable coefficients was not a problem, quoting references that did not fully support this concept 13,18 . However, when both the high and low risk functions of the Euro-Score project were tested, the high-risk model did not produce good calibration 13,16,20,21 . This may depend upon a casual combination of high levels of risk factor coefficients and high levels of estimated survival that may tend to balance one each other. In fact, in a test made on our NE population group we found that applying the widest coefficients (that by themselves induce a poor calibration) together with higher levels of survival (that by themselves induce a poor calibration) ended up in an acceptable calibration.
In at least three contributions the Hosmer-Lemeshaw chi-squared test was applied to data derived from models that included survival data 19,21,23 , that is in an improper way.

Conclusions.
On the basis of this experience, good re-calibration can be achieved only in a few cases and when lucky circumstances do coincide. Therefore, much caution must be used in reaching valid conclusions.
The re-calibration procedure will probably lose its role when more and more countries will be able to produce their own risk functions. However, this process may stimulate the need to study deeper the problem of the heterogeneity or non-heterogeneity of multivariable coefficients of risk factors. The absence of systematic differences across multivariable coefficients of CHD risk function has been repeatedly shown comparing the cohorts and/or areas of the Seven Countries Study 3,29-34 but this seems to be an isolated effort that calls for a systematic approach from many different sources. There is a theoretical, and not only practical issue, since demonstrating the overall similarity in the magnitude of multivariable coefficients could represent the possible identification of a general biological rule linking risk factor levels and events, when everything else being equal.
This analysis was mainly a methodological exercise, but we acknowledge the fact that a number of "new risk factors" may improve the prediction of coronary events. Among others, one may wish to explore the effects of statins on cardiovascular outcomes by smoking status 35 , vitamin-D deficiency and non-lipid biomarkers of cardiovascular risk 36 , evidence-based assessment of lipoprotein(a) as a risk biomarker for cardiovascular diseases 37 , association of serum lipids and CHD in just observational studies 38 and whether low high-density lipoprotein-cholesterol should be treated 39 . Unfortunately, all these terms either were not available in our old study or the questions were far from our main objective.
Declarations. Ethics approval and consent to participate. Entry examination was held well before the Helsinki declaration with consent implied by participation in the examinations, while subsequently verbal or written consent was obtained to collect follow-up data.
Availability of data and materials. The data are not available to the public or shared as they are still confidentially kept by the Principal Investigators of the Seven Countries Study.