Comparative performance of lung cancer risk models to define lung screening eligibility in the United Kingdom

Background The National Health Service England (NHS) classifies individuals as eligible for lung cancer screening using two risk prediction models, PLCOm2012 and Liverpool Lung Project-v2 (LLPv2). However, no study has compared the performance of lung cancer risk models in the UK. Methods We analysed current and former smokers aged 40–80 years in the UK Biobank (N = 217,199), EPIC-UK (N = 30,813), and Generations Study (N = 25,777). We quantified model calibration (ratio of expected to observed cases, E/O) and discrimination (AUC). Results Risk discrimination in UK Biobank was best for the Lung Cancer Death Risk Assessment Tool (LCDRAT, AUC = 0.82, 95% CI = 0.81–0.84), followed by the LCRAT (AUC = 0.81, 95% CI = 0.79–0.82) and the Bach model (AUC = 0.80, 95% CI = 0.79–0.81). Results were similar in EPIC-UK and the Generations Study. All models overestimated risk in all cohorts, with E/O in UK Biobank ranging from 1.20 for LLPv3 (95% CI = 1.14–1.27) to 2.16 for LLPv2 (95% CI = 2.05–2.28). Overestimation increased with area-level socioeconomic status. In the combined cohorts, USPSTF 2013 criteria classified 50.7% of future cases as screening eligible. The LCDRAT and LCRAT identified 60.9%, followed by PLCOm2012 (58.3%), Bach (58.0%), LLPv3 (56.6%), and LLPv2 (53.7%). Conclusion In UK cohorts, the ability of risk prediction models to classify future lung cancer cases as eligible for screening was best for LCDRAT/LCRAT, very good for PLCOm2012, and lowest for LLPv2. Our results highlight the importance of validating prediction tools in specific countries.


BACKGROUND
Lung cancer is the leading cause of cancer death worldwide. 1,2 Two large, randomised trials have now demonstrated that screening by low-dose computed tomography (LDCT) can reduce mortality from lung cancer among people with a heavy smoking history. Lung cancer mortality was reduced by 20% over 5 years in the USA National Lung Screening Trial (NLST) with 3 annual LDCT screens 3 and by 24% (men) and 33% (women) over 10 years in the Dutch-Belgian NELSON trial with 4 LDCT screens over 5.5 years. 4 The USA issued a national recommendation for lung screening in 2014. 5 In the United Kingdom, there have been several successful pilot studies, including the Manchester Lung Health Checks 6,7 and the Liverpool Healthy Lung Programme. 8 Compared with the implementation of lung screening in the USA, the UK has often been more successful in terms of overall uptake and engagement of populations with low socioeconomic status (SES), 6,9,10 and lung cancer detection rates have often exceeded those in the NLST. 3,6,11,12 Building on this success, the National Health Service (NHS) England is implementing a £70 million programme of "Targeted Lung Health Checks" in 10 areas with high lung cancer mortality. 13,14 In the USA, the US Preventive Services Task Force (USPSTF) guidelines use categorical criteria to determine who is eligible for screening. Eligibility by the 2013 guideline required age 55-80 years, at least 30 pack-years smoked, and for former smokers, no more than 15 years since quitting. 5 The 2020 draft guideline expands eligibility by lowering the age-to-start from 55 to 50 years, and lowering the pack-year threshold from 30 to 20 packyears. 15 However, secondary analyses of the NLST demonstrated that lung screening may be more efficient and cost-effective when eligibility is based on individual lung cancer risk, estimated using a continuous risk prediction model. [16][17][18][19] Lung screening in the UK was implemented using individual risk-based eligibility from the beginning, and the NHS England protocol specifies that individuals aged 55-74 years can be screened if their lung cancer risk exceeds 1.51% by the PLCOm2012 model (6-year risk) or 2.5% by the Liverpool Lung Project version 2 (LLPv2) model (5-year risk). 14 The choice of which risk model to use for screening eligibility is important. Poor model discrimination or calibration can reduce the efficiency and cost-effectiveness of screening and even lead to net harm if models select individuals who are unlikely to benefit from screening. Risk models differ in the variables that they include; for example, the LLP/LLPv2/LLP model version 3 (LLPv3) models include only one measure of smoking (duration), whereas the Lung Cancer Death Risk Assessment Tool (LCDRAT) includes smoking duration, pack-years, quit-years, and intensity. 18,20 Most models, including PLCOm2012 and LCDRAT, were developed using USA data, whereas the LLP/LLPv2/LLPv3 models were developed in the UK. 17,18,20 Although both PLCOm2012 and LLPv2 have been implemented successfully in screening studies, the absence of outcome data on individuals who were not eligible (and thus not screened) has precluded evaluation of whether either of these is the optimal model. 6,11 Several models have been evaluated in population cohort studies in the USA, 21,22 but non-USA evaluations are scarce, 23,24 and none include data from UK cohorts.
Here we performed a comparative evaluation of lung cancer risk models to define lung screening eligibility in the UK. We analysed 3 cohort studies to quantify the calibration and discrimination of risk models and then compared their ability to classify future lung cancer cases into a group defined as eligible for screening.

METHODS
We analysed longitudinal data from the UK Biobank, European Prospective Investigation into Cancer and Nutrition (EPIC)-UK, and Generations Study cohorts. The UK Biobank is a prospective cohort study of 500,000 people aged 40-72 years at recruitment (2006)(2007)(2008)(2009)(2010). 25 EPIC-UK recruited participants aged 45-74 years in Cambridge and aged ≥20 years in Oxford during 1993-2000. 26 EPIC-Cambridge used population-based recruitment of patients of general practitioners, while EPIC-Oxford was comprised of both population-based recruitment and a subset targeted at "health conscious" individuals. Finally, the Generations Study recruited 112,000 women aged ≥16 years during 2003-2011, of whom about one-third had a mother, daughter, or sister also participating in the study. 27 In all cohorts, cancer and death ascertainment relied on registry linkages at minimum, sometimes with additional active follow-up. [25][26][27] From all participants in these cohorts, we restricted to those known to be current or former smokers who were aged 40-80 years at enrolment, including 217,199 in UK Biobank, 30,813 in EPIC-UK, and 25,777 in the Generations Study (total N = 273,789). Never smokers and participants with unknown smoking status were excluded. After these restrictions, substantial amounts of missing data were present for some variables in some cohorts, such as 31% missing smoking intensity (cigarettes per day) in UK Biobank. Missing data were handled using various approaches within the framework of multiple imputation (see Supplement). Among participants who were alive and free of lung cancer at the end of follow-up (i.e. in whom future lung cancer status would be unknown), follow-up time was at least 6 years for all participants in UK Biobank and EPIC-UK and for 88% in the Generations Study.
We evaluated 8 lung cancer risk models. These included the PLCOm2012 and LLPv2 models, which are proposed for use in selecting screening participants in the NHS protocol. 11,14,17,20 We also evaluated the Bach model, 28 the LCDRAT, 18 the Lung Cancer Risk Assessment Tool (LCRAT), 18 the original LLP model, 20 the LLPv3, 29 and the Hoggart model. 30 Each of these models is either a USA-based model that performs well in USA data (Bach, LCDRAT, LCRAT, PLCOm2012) 21 or a European model whose performance in European data is unknown (LLP, LLPv2, LLPv3, Hoggart). Risk thresholds above which screening can be offered have been proposed for LCRAT and LCDRAT, 19,31 and LLPv3, 29 in addition to PLCOm2012 and LLPv2.
Risk estimates for the LCRAT (5-year time horizon), LCDRAT (5-year), and Hoggart (1-year) models were generated using the lcmodels package in R. 32 Estimates for the Bach model used code adapted from lcmodels to reduce the time horizon to 5 years. Estimates for PLCOm2012 (6-year time horizon), LLPv3 (5-year), LLPv2 (5-year), and LLP (5-year) were calculated directly. 17,20 We present results for two models in Supplementary Table 1 and do not include them in discussions below, due to redundancy with LLPv2/LLPv3 (LLP) and very high overestimation of risk (Hoggart). We present results for LLPv2 in the main manuscript, even though it may be eventually replaced by LLPv3, because LLPv2 is listed in the NHS England protocol.
We calculated calibration as the ratio of expected to observed (E/O) lung cancer cases or deaths, overall and in subgroups. We quantified discrimination using the area under the receiveroperating curve (AUC) statistic. The 95% confidence intervals (CIs) for calibration and discrimination statistics account for within and between imputation variance. 21

RESULTS
Among the 217,199 current or former smokers aged 40-80 years in UK Biobank, 1265 lung cancer cases were diagnosed within 5 years of enrolment, and 700 lung cancer deaths occurred in this period (Table 1). In EPIC-UK, 156 lung cancers and 100 lung cancer deaths occurred over 5 years among 30,813 participants, and in the Generations Study, 53 lung cancers and 26 lung cancer deaths occurred over 5 years among 25,777 participants. Distributions of demographic and smoking variables differed across cohorts.
Calibration estimates for all risk models were >1 in all cohorts, indicating that the models predicted more lung cancer cases (or for LCDRAT, lung cancer deaths) than were observed over the time period specified by the model (Fig. 1). The extent of overestimation of risks (and therefore poorest calibration) was the highest in the Generations Study and the lowest in UK Biobank for each model. Across the risk models, in UK Biobank, LLPv3 was best calibrated (E/O = 1.20, 95% CI = 1.14-1. 27 28). The order in which models were ranked in EPIC-UK and the Generations Study was similar to UK Biobank, but E/O statistics were higher.
To further investigate model overestimation, we calculated E/O estimates stratified by demographic and smoking characteristics in UK Biobank (Table 2). Analogous estimates for discrimination (stratified AUCs) are presented in Supplementary Table 2. For all models, there was a strong positive relationship between model overestimation and SES, which was measured by the area-level Townsend deprivation index. For example, for LCDRAT, E/O statistics across SES quartiles were 2.03 (highest SES), 1.95, 1.61, and 1.26 (lowest SES). Patterns for other characteristics differed across models, though frequent patterns included more overestimation in men than in women, in former smokers than in current smokers, and at the extremes of age (40s and 70s). When stratifying by quintiles of predicted risk ( Supplementary Fig. 1), PLCOm2012 substantially underestimated risk in the lowest-risk quintile while modestly overestimating risk in the upper categories. Overestimation tended to be higher at higher risks for LLPv2, LLPv3, and Bach, while it was higher at lower risks for LCDRAT and LCRAT. Table 3 considers the hypothetical impact of using each risk model to determine who is screening eligible in the combined  (Table 3). Similarly, 50.2% of future lung cancer deaths (N = 415) over 5 years would be screening eligible, among which some fraction could be prevented by earlier detection. Applying risk models and the thresholds described above to define the screened population identified higher proportions of future cases: the LCDRAT and LCRAT identified the highest proportion of future cases as screening eligible (each 60.9%, N = 897), followed by PLCOm2012 (58.3%, N = 859), Bach  Fig. 1 Calibration of lung cancer risk models in the UK Biobank, EPIC-UK, and Generations Study cohorts, as measured by the ratio of expected to observed cases. UKB UK Biobank, GS Generations Study. Estimates for UK Biobank also appear in Table 2 and Supplementary  Table 3.  Table 1 shows data prior to imputation of missing data. UK educational categories were mapped to USA categories as described in the Supplement. Body mass index categories were defined as follows: <18.5 underweight, 18.5-24.9 normal weight, 25-29.9 overweight, and ≥30 obese. "Asbestos exposure" reflects selfreported occupational asbestos exposure. COPD chronic obstructive pulmonary disease, NA not applicable. a Eligibility by the US Preventive Services Task Force (USPSTF) 2013 criteria requires age 55-80 years, at least 30 pack-years, and no more than 15 quit-years. Eligibility by the draft USPSTF 2020 criteria requires age 50-80 years, at least 20 pack-years, and no more than 15 quit-years.
Comparative performance of lung cancer risk models to define lung. . . HA Robbins et al.
We analysed individuals aged 40-80 years, but the NHS England protocol restricts eligibility to ages 55-74 years. When we repeated our analysis after restricting to individuals aged 55-74 years in UK Biobank (n.b. there were no participants in UK Biobank aged >74 years), AUCs decreased as expected due to the loss in prediction derived from age variation. However, the rank order of AUCs and the calibration results were not affected (Table 4).
Supplementary Table 3 describes the characteristics of lung cancer cases that are not identified as screening eligible (are "missed") by USPSTF 2013, USPSTF 2020, and each risk model at the risk thresholds identified in Table 3. Compared with USPSTF 2013, the cases missed by risk models, while fewer in number, were more commonly former smokers (62% of cases missed by USPSTF vs. 69-80% for risk models). They also tended to be slightly younger (median age at baseline 63 years for cases missed by USPSTF vs. 60-61 years for risk models) and slightly more frequently female (53% of cases missed by USPSTF vs. 54-57% for risk models). Patterns using thresholds based on USPTF 2020 were similar or more pronounced.

DISCUSSION
Lung cancer screening has the potential to substantially reduce lung cancer mortality among people with a heavy smoking history. In the United Kingdom, the success of the Targeted Lung Health Checks will depend partially on whether the programme can be implemented efficiently and cost-effectively. The protocol currently recommends the use of the PLCOm2012 and LLPv2 risk models to identify screening-eligible individuals. 14 In this study, we compared the performance of these two models along with others that have performed well in other high-income settings. We found that the LLPv2 model had worst calibration and classified the lowest proportion of future lung cancer cases as eligible for screening. The PLCOm2012 model had better calibration, though all models predicted more cases than were observed. The LCDRAT was able to classify the highest proportion of future lung cancer cases as eligible for screening, with very good performance also observed for the LCRAT, PLCOm2012, and Bach models.
The models evaluated in our study were previously validated in multiple USA cohorts, including the NIH-AARP and CPS-II, 21 as well as the NLST and PLCO trials. 22 Taken together, these studies showed good calibration for the Bach model, LCRAT, LCDRAT, and PLCOm2012 but overestimation of risks for the LLP model. In our study, all models overestimated risks; the extent was greatest for LLPv2. For discrimination, prior results in NIH-AARP and CPS-II showed best performance for LCDRAT, followed sequentially by LCRAT, PLCOm2012, Bach, and LLP. 21 The study analysing PLCO and NLST found higher discrimination for PLCOm2012 and Bach compared with LLP. 22 The order in which models ranked in our study was similar, with best performance for the LCDRAT, LCRAT, Bach, and PLCOm2012 models. The likely explanation for inferior discrimination of the LLP/LLPv2/LLPv3 models is that they incorporate only smoking duration (omitting intensity and quityears) and use categorical instead of continuous parameterisations of age and smoking. 20 Overall, the magnitude of AUCs in our study (often exceeding 0.80) was higher than in prior reports (typically ranging from 0.75 to 0.80). 21,22 This is likely caused by a wider age distribution in our analysis, which included more younger individuals.
In UK screening studies, lung cancer detection rates have commonly been higher than in the NLST. 3,6,7,11 Lung cancer detection over 2 screens was 4.5% in the Manchester Lung Health Checks, compared with 1.7% in NLST. 3,6,7 Detection rates in singlescreen UK studies are commonly approximately 2%. 11,12 These observations might have been taken as evidence that USA-based lung cancer risk models would predict too few cases in UK populations, but we found the opposite result. Unlike the screening studies, which commonly comprised individuals living in low-SES communities, 6,11,12 the research cohorts we analysed have overrepresentation of high-SES individuals and are likely influenced by "healthy volunteer" effects. In UK Biobank, all-cause mortality among 70-74-year-olds is half that in the general population (although the difference in cancer incidence is smaller). 33 EPIC-Oxford is partially comprised of "health-conscious" individuals. The Generations Study is a volunteer cohort with recruitment based on engagement in a health issue (finding the causes of breast cancer), and cancer incidence is estimated to be 16% lower than in the general UK population (unpublished data). These influences, taken together with the overrepresentation of high-SES individuals in whom overestimation is highest, suggest that model calibration would be better in the overall UK population of ever-smokers than in the research cohorts we analysed.
There is a troubling consequence to the correlation between risk model overestimation and SES. Considering four individuals with the same true risk of lung cancer, one in each SES quartile, Estimates are provided for UK Biobank only due to the small size of the other cohorts. SES is measured using the Townsend deprivation index, an area-level measure that is applied to individuals based on their place of residence. Quartiles of the Townsend deprivation index were defined such that UK Biobank participants were divided equally, using the following cutpoints: −6.26 (minimum), −3.42, −1.75, 1.21, 11.0 (maximum). Body mass index categories were defined as follows: <18.5 underweight, 18.5-24.9 normal weight, 25-29.9 overweight, and ≥30 obese. CPD cigarettes per day, SES socioeconomic status.
Comparative performance of lung cancer risk models to define lung. . . HA Robbins et al.
the individual with the highest SES will have her risk overestimated the most and will be most likely to be classified as eligible for screening. This could exacerbate disparities in lung screening. 34 Our findings suggest that there are factors related to SES that increase lung cancer risk but are not captured by the variables in risk models or are related to differential measurement error for the variables in risk models. Further, the effects of USA educational and ethnicity categories on lung cancer risk are unlikely to align with these effects in the UK. Any future efforts to develop risk models for use in lung screening in the UK should focus carefully on the role of SES and the accurate estimation of risk within subgroups.
The choice of what risk threshold to use for screening eligibility depends on multiple factors, including the accepted trade-off of benefits and harms and the capacity of the health system. We did not address these issues here, but we did identify thresholds that would classify the same number of individuals as screening eligible as USPSTF criteria. The thresholds selected by this approach, when considering USPSTF 2013 criteria, aligned with those already proposed for the PLCOm2012 (1.5% in our study vs. 1.51% in the NHS protocol) and the LLPv2 (2.3% in our study vs. 2.5% in the NHS protocol). 14 There was a larger difference between the threshold we identified for LCDRAT (0.8% 5-year lung cancer death risk) and previously proposed thresholds (1.2, 1.33, 1.7%). 19,31 Thresholds identified based on USPSTF 2020 criteria, which broadened eligibility substantially, were much lower, and it is not clear whether all individuals meeting these thresholds would have a favourable trade-off of screening benefits and harms.
Important limitations of risk-based eligibility for lung screening are receiving increased recognition. Risk models preferentially select older individuals, reducing life-years gained and costeffectiveness, as well as individuals with comorbidities such as COPD who may have lower screening benefits. 19,[35][36][37] To address these issues, a model to define eligibility based on predicted lifeyears gained from screening has been proposed, 37 which incorporates LCDRAT and an additional prediction model for overall mortality. Important evidence for the comparative performance of PLCOm2012 and LLPv2 will be provided by the Yorkshire Lung Screening Trial, which is enrolling participants based on eligibility by either model to compare their performance directly. 38 The trial is also collecting sufficient information to validate the LCRAT, LCDRAT, and Bach models retrospectively.
Our study has important limitations that result from its approach of analysing cohort data. Our findings cannot be assumed to be nationally representative for the UK, though the rank-order performance of lung cancer risk prediction models is likely generalisable. There was also a substantial amount of missing data for key smoking variables, which we handled by multiple imputation. For comparison, we calculated E/O statistics and AUCs using the subset of individuals in UK Biobank who had complete data on required variables (63% of the cohort). The degree of overestimation was reduced for all models (Supplementary Fig. 2), while AUCs were not affected ( Supplementary  Fig. 3). By contrast, our approach of analysing multiple cohorts is a strength, because it allows for evaluating whether results are consistent across studies. Inclusion of the Generations Study, despite its small size, is important due to the underrepresentation of women in the European screening trial literature. Women in the Generations Study had lower smoking intensity and longer periods of cessation compared with participants in the other studies, highlighting potential equity issues around screening eligibility.
The question of which risk model optimally defines screening eligibility is somewhat distinct from the question of which model can most practically be implemented. Web-based tools are available for some risk models, including the Risk-based NLST Outcomes Tool for LCDRAT and LCRAT 39 and MyLungRisk for LLPv2, 40 and a spreadsheet tool is available for PLCOm2012. 41 It is possible to integrate these tools into electronic medical records to facilitate calculations, but input information would first need to be verified with the patient. A practical solution may be to use a simplified model or algorithm applied to electronic medical records to initially identify people who are potentially eligible, followed by a more precise assessment of risk using data collected in person or by phone. The goal to automate further the calculation of lung cancer risk and the classification of individual screening eligibility represents an ongoing challenge.
In conclusion, we analysed the performance of lung cancer risk models in three UK cohorts, including the PLCOm2012 and LLPv2 models that are recommended for use in lung screening by the NHS protocol for Targeted Lung Health Checks. We found that the LLPv2 model had worst calibration and classified the lowest proportion of future lung cancer cases as eligible for screening. The LLPv3 model had best calibration (but poor discrimination), while the LCDRAT was best able to identify individuals at high risk of lung cancer. All models strongly over-predicted risk in groups with high SES, raising concerns about exacerbation of disparities in lung cancer screening. Taken together, our results suggest potential revisions to the list of models endorsed by the NHS lung screening protocol and that further work may be needed to ensure that eligibility for lung cancer screening can be defined equitably in the UK population. More generally, they highlight the importance of carefully validating risk prediction models in specific contexts before they are applied in practice.