Developing a risk prediction tool for lung cancer in Kent and Medway, England: cohort study using linked data

Howell, David; Buttery, Ross; Badrinath, Padmanabhan; George, Abraham; Hariprasad, Rithvik; Vousden, Ian; George, Tina; Finnis, Cathy

doi:10.1038/s44276-023-00019-5

Download PDF

Article
Open access
Published: 17 October 2023

Developing a risk prediction tool for lung cancer in Kent and Medway, England: cohort study using linked data

David Howell^1,2,
Ross Buttery¹,
Padmanabhan Badrinath^3,4,
Abraham George^3,5,
Rithvik Hariprasad⁶,
Ian Vousden^7,8,
Tina George^9,10,11,12 &
…
Cathy Finnis¹³

BJC Reports volume 1, Article number: 16 (2023) Cite this article

928 Accesses
1 Altmetric
Metrics details

Abstract

Background

Lung cancer has the poorest survival due to late diagnosis and there is no universal screening. Hence, early detection is crucial. Our objective was to develop a lung cancer risk prediction tool at a population level.

Methods

We used a large place-based linked data set from a local health system in southeast England which contained extensive information covering demographic, socioeconomic, lifestyle, health, and care service utilisation. We exploited the power of Machine Learning to derive risk scores using linear regression modelling. Tens of thousands of model runs were undertaken to identify attributes which predicted the risk of lung cancer.

Results

Initially, 16 attributes were identified. A final combination of seven attributes was chosen based on the number of cancers detected which formed the Kent & Medway lung cancer risk prediction tool. This was then compared with the criteria used in the wider Targeted Lung Health Checks programme. The prediction tool outperformed by detecting 822 cases compared to 581 by the lung check programme currently in operation.

Conclusion

We have demonstrated the useful application of Machine Learning in developing a risk score for lung cancer and discuss its clinical applicability.

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Genome-wide association studies

Article 26 August 2021

An overview of clinical decision support systems: benefits, risks, and strategies for success

Article Open access 06 February 2020

Introduction

Lung cancer is one of the major causes of death worldwide [1, 2]. In the UK, 48,500 new lung cancer cases are detected every year of which 34,800 die, accounting for 21% of all cancer deaths during 2017–2019 [3]. An estimated 86% of lung cancer deaths in the UK are caused by tobacco smoking [4]. Furthermore, there is an association with prolonged environmental exposure to air pollutants such as sulphur dioxide, nitrogen oxides, nitrogen dioxides, or arsenic. Hence, nations with greater pollution levels are likely to have higher incidences of lung cancer [5]. Until the advent of the Targeted Lung Health Check (TLHC) pilots, it was only when a person started to exhibit the symptoms of lung cancer, that a diagnosis of the disease could be made. Some of these symptoms could include coughing, shortness of breath, unexplained weight loss, wheezing, haemoptysis, chest discomfort, exhaustion and decreased appetite [6].

Lung cancer outcomes have improved only marginally over the last 40 years and remain poor in comparison to most other cancers—just 17.7% of women and 12.9% of men in the UK survive after diagnosis for 5 years or longer [7]. The lack of overt or specific symptoms in the early stages of lung cancer often leads to late presentations, resulting in delayed diagnosis and treatment [8]. However, early detection and diagnosis, followed by effective treatment, improves survival for nearly all cancer types. According to Cancer Research UK [9] “around 6 in 10 people with lung cancer survive their disease for 5 years or more, if diagnosed at the earliest stage. This falls to <1 in 10 people when lung cancer is diagnosed at the most advanced stage.”

When diagnosed early, more treatment options are available for lung cancer, including surgical resections. If operable, primary treatment costs are largely attributable to surgical removal procedures. However, as the disease advances to Stages 3 and 4, the expenses associated with surgical interventions tend to decrease, whilst the costs related to systemic therapies escalate significantly. This shift in treatment modalities is primarily due to the diminished feasibility of surgical removal as the cancer becomes more widespread. Instead, systemic therapies become more pivotal at advanced stages, aiming to control tumour growth, alleviate symptoms and potentially prolong survival. Consequently, timely identification and detection of lung cancer can significantly alleviate the financial burden on the state, the insurer, patients, and their families. This includes mitigating the expenses associated with advanced-stage treatments, extended hospital stays, intensive therapies, and palliative care services [10].

In the quest for earlier diagnosis of lung cancer, in June 2023, the UK government announced plans for a new national targeted lung cancer screening programme, based on learning from existing Targeted Lung Health Check (TLHC) pilot sites. The programme, which is supported by a recommendation from the UK National Screening Committee, will invite patients aged between 55 and 74 who are current or former smokers for a lung health check, which may include a low-dose CT scan. In areas where the TLHC programme has been operating, early data suggests that approximately 76% of lung cancers are diagnosed at stages 1 and 2, which is a substantial improvement compared with usual pathways of care [11].

Artificial Intelligence (AI) is a new and rapidly evolving field where computers are taught to think like humans. Due to its enhanced accuracy, precision, and decision-support capabilities, AI has begun to be implemented in modern medicine. It is being used in two ways namely, physical and virtual. Physical applications of AI include robots that are automated to perform tasks such as caring for the elderly and others that assist in surgeries. Machine learning (ML) is a subfield of AI that deals with the virtual aspect. ML models can be trained to detect or predict occurrences of a health condition [12]. AI is suitable in the medical field as it has no concept of fatigue unlike doctors and therefore can process large number of images and data at any given time [13]. This requires a good prediction model to be designed which involves acquiring a large dataset for training the model. The bigger and more diverse the dataset is, better the results that can be expected from it [14]. However, researchers need to be aware that quality, curation, and expert annotation are vitally important while considering what data to include.

With the help of AI, we can make accurate assessments of one’s risk of lung cancer. The detection or prediction of lung cancer serves as a prime illustration where the utilisation of AI is indispensable. This is due to the fact that lung cancer is a highly time-sensitive condition and early diagnosis can be difference between life and death. Risk factors associated with lifestyle choices can be used to provide profiles of potential risks. The objective of any risk prediction tool, such as the one described in this paper, is to identify a small fraction of the population in which a large proportion of the disease cases will occur [15].

The National Screening Committee has recommended population screening for lung cancer as targeted lung cancer screening with low-dose Computerised Tomography is cost-effective at a threshold of £20,000 per QALY [16, 17]. Current attempts to improve early lung cancer diagnosis involve diagnostically evaluating large volumes of individuals with less than 1% of successful case identification [18, 19]. The population of England is estimated to increase by 6% over the next decade [20]. Furthermore, there has been a 19% increase in the prevalence of cancer in England over the last decade and published figures on the number of people waiting for a diagnosis or treatment for cancer have shown the huge challenge facing NHS cancer services, with tens of thousands of people waiting too long for diagnosis or vital treatment, especially since the start of the pandemic of COVID-19 [21]. Hence, the NHS cannot afford to provide existing healthcare in the same way in the future and will not have a sufficient workforce to deliver this. This challenge is not just isolated to the UK but is a common issue worldwide.

Our study aims to address the challenge of delayed diagnosis of lung cancer by exploiting the processing power of AI. We developed a model for providing risk-based predictions of lung cancer based on an individual’s lifestyle choices, family history and other clinical data. We had access to a large dataset consisting of 1.25 million adult residents across the Kent and Medway region called the Kent Integrated Dataset (KID) [22]. We harnessed the capabilities of ML to train the model in making risk predictions by extracting patterns from data records of residents who had been diagnosed and treated for lung cancer. Our objective was to find the best performing model among a group of ML models that gave accurate predictions of the risk of lung cancer.

Methods

The County of Kent

Kent County Council covers the largest population footprint of any other council in England with a population of 1.6 million [23]. It has an exceptional spread of affluence and extreme poverty. Before COVID, a life expectancy gap of almost 20 years already existed between the least and most deprived wards [24, 25].

Dataset description

Data for this study was taken from the KID [22], which contains a vast array of pseudonymised integrated health and care data. The data for KID are derived from various sources. Nearly 40% of the data is from secondary care, over ¼ from primary care and the rest are from a range of sources including community and mental health trust providers and other publicly available data at a spatial level. The KID is overseen by a steering group known as the Kent & Medway Shared Health and Care Analytics Board (SHcAB) that includes representatives of Kent County Council, local health commissioners and information governance leads. The SHcAB considers issues such as information governance, development of the dataset and applications for use of the data. The Kent and Medway data warehouse team provides day-to-day administration and project management. Access was granted to the first author by the SHcAB for the study duration through established due process. Patients can opt-out of contributing data to the KID by informing their GP surgery that they do not want their data to be shared with external organisations. It has to be appreciated that the data is not in the public domain and it is a pseudonymised person level data set for most of the variables. We established a project oversight group, supported by the Kent & Medway cancer alliance which included cancer clinicians, service managers, Public Health physicians, epidemiologists, and AI experts. Regular stakeholder engagement took place throughout the study involving patients and public representatives.

Data contained within the KID represented a 6-year longitudinal record of health and care data for residents for 2014–2019 which was 1,865,382. An initial exclusion for under 18s years was made (n = 599,866) which reduced the cohort to 1,265,516. We then removed a further 10,532 patients (0.8% of the total population), due to incomplete or missing records data (for example smoking status), which took the original cohort size down to 1,254,984. We used a set of pre-determined criteria to exclude the records with missing data. Given that recording of ethnicity is poor across the NHS, we did not use it as an exclusion criteria. We excluded records where data on one or more key variables relevant to our analysis were missing. These are: Pseudonymised Unique Patient ID, Smoking Status of the individual, GP Practice of Registration, Age, Gender, and valid Postcode. The final dataset contained a total of 1,254,984 patients, of which 6053 were diagnosed with primary lung cancer during this period and these were included within the scope of this investigation. The final dataset used in the analysis had no missing data on smoking status. The cohort selection (lung cancer cohort) was only made up of patients with primary malignant lung cancers, excluding benign tumours and metastases from other types of cancer. To ensure comprehensive capture of all patients meeting the criteria, we assessed both primary and secondary healthcare records using relevant SNOMED or ICD-10 codes, respectively. Patients with Lung Cancer included all confirmed diagnoses regardless of diagnosis of care setting, staging at the time of diagnosis, disease progression or onward treatment options and outcomes. Core dimensions of data used within this study are shown below:

Patient Demographics
Primary Care (Events, Consultations, Long term condition registers, Medications, Deaths)
Secondary Care (A&E, Inpatient Spells and Outpatients, Critical Care Bed Days)
Mental Health (Inpatient and Outpatient History)
Community Care (Contacts, Appointments, Minor Injuries Units and Walk In Centers)
Wider Health Determinants including Housing, Education, Employment, and Income.
Environmental Datasets—Pollution, Radon ground levels

We did not have information on all the above variables at an individual patient level. We had individual patient-level data on patient demographics, primary, secondary care, mental health and community clinical care activities. For the wider determinants of health including environmental factors, we applied spatial level data at the Lower layer Super Out Put area, a small geographical area in the UK with an average of 650 households to the patient level datasets.

Data access

All NHS organisations including general practices across Kent & Medway had entered into Joint data Controller arrangements, which includes a common process for safe, secure and lawful access to their data in the KID for population health analytics including work such as ours. This process is administered by a system wide oversight group representing the organisations, called the Kent & Medway Shared Health & Care Analytics Board. Patient-level consent would not apply in this context as the dataset is historical and fully pseudonymised and deidentified. Because of the above arrangements, access to the data in KID, its analysis and sharing of the findings, no ethical approval was required as per existing arrangements.

Data pre-processing

The dataset contained missing values mainly in the attribute named ‘ethnicity’ as shown in Table 1, despite a lot of work to try and capture ethnicity coding from various sources. We, therefore, excluded this from the model as we felt that it was not appropriate to try and use average value or synthetic data derivative, which is common practice. Other dataset attributes had no missing or outlier values from features, so no further transformations were made on the remainder of the datasets.

Table 1 Baseline characteristics of cohort groups.

Full size table

The data attributes are grouped into life history, symptoms, diagnostics, treatment, and end-of-life care based on the stage at which the data are collected, as depicted in Fig. 1. To prepare the model for predicting patients’ risk ratios, we extracted only the essential attributes from the dataset. These columns were selected based on their potential to provide valuable predictive information. We specifically focused on data concerning the pathways leading to the diagnosis of lung cancer as it held valuable insights regarding the associated causes and symptoms. Attributes related to cancer diagnosis or data related to 2-week wait urgent referrals, appointments to see an oncologist, Chest X-Rays and Low Dose Computerised Tomography scans for confirming diagnosis, treatment options such as chemotherapy and radiotherapy and mortality were omitted. These attributes were excluded from the dataset because they were deemed as non-predictive elements that did not offer significant insights into the associated risks of a positive diagnosis of lung cancer. We excluded the above diagnostics and treatment elements up to 12 months before the date of diagnosis.

**Fig. 1: Pathways leading up to and beyond a Lung Cancer Diagnosis for patients.**

Relative risks (RR) were calculated for all the variables and were used to determine the important attributes and for categorisation. RR is the ratio of the incidence of an event occurring (Lung Cancer) with an exposure (e.g., smoking) versus the incidence of the same event occurring without the exposure. For example, the relative risk of developing lung cancer in smokers (the exposed group) versus non-smokers (non-exposed group) would be the probability of developing lung cancer for smokers divided by the probability of developing lung cancer for non-smokers. All characteristics of the individual datasets such as medications, events, tests, demographic qualities or wider determinant of health factors were tested, and risk-scored using this methodology. To reduce the number of categories we collapsed these into meaningful groupings, and these were informed by the higher relative risk of related variables. For instance, for respiratory disorders such as COPD and Asthma each of which have numerous diagnosis codes, these were built up into simple three-state options; Yes, No or Has Familial History. Other features, such as smoking history and activity with high dimensionality were ranked into similar groups by creating scores.

Model development

We used feature encoding to reduce the number of states and to simplify the complexity of model development and enhance performance. One-hot encoding and standard scaling was used for the feature encoding [26]. Given the need to develop a scalar response to risk scoring in order to aid prioritisation of patients at greatest risk of developing lung cancer within a screening pool, logistical and other categorical models were ruled out. Traditional linear regression was selected as an initial candidate model to detect lung cancers early and thereby improving outcomes over and above the current screening protocol for lung cancer in the UK.

Using a combination of methods namely informed by the data, proposals from clinical experts and published literature [27, 28], 16 attributes were identified. We took our entire population data for n attributes, which could be anywhere between 2 and 16, and split this into 70% training and 30% validation datasets [29]. We then used the 70% dataset to build a linear regression model on these n attributes. We developed a loop within Python [30] to identify all the possible combinations of these 16 attributes in their ability to detect lung cancer. We applied this model for n attributes to the 30% test population to achieve an output which is number of lung cancer cases detected. This was repeated one hundred times (Fig. 2) in order to create multiple outputs that could be averaged to test the models’ repeatability and for onward evaluation. We then employed boot strapping [31] to test the general ability of the model to work across randomised populations. In each run, both the 70% training set and the 30% validation set were again randomised to eliminate any potential biases or chance influences. This randomisation also aimed to provide comprehensive average performance statistics for all models. In each model run the TLHC eligibility criteria were applied, and the number of cancers counted. This was compared to the highest risk scored patients identified by the prediction model, keeping both the screening cohort sizes equal.

**Fig. 2: Steps involved from the beginning to the end of the study process.**

Model evaluation

The output of this model is not binary/logistical (with or without cancer) but a continuum of risk of developing the cancer. As we have stated within the dataset description section, the dataset we used also did not contain person-level information on all the variables included in the model. Hence, the traditional parameters to express the validity of a screening test such as sensitivity, specificity, positive predictive value, negative predictive value, area under the curve and likelihood ratios are not applicable. Instead, we rationalised that if the model is working most efficiently, we should be able to demonstrate more lung cancer cases being found within a screening pool in the population compared to that of the current screening pilots ongoing in England. In order to baseline our evaluation, therefore, we compared the output of the model against the current screening population for the TLHC [32] programme. Patients meeting the following three criteria will be invited for screening:

are over 55 but younger than 75 years old
are registered with an GP in the area the scheme is operating
have ever smoked, and this is recorded with the GP.

This number of cases found from the TLHC programme was then compared with the number of cases identified using the linear regression model using the top-performing combination of attributes.

Results

Selected characteristics of cohorts included in the study are shown in Table 1.

Relative risks for the attributes included in the model are presented in Table 2.

Table 2 Relative risks for the attributes and various levels of exposures included in the model.

Full size table

In the attribute concerning family history of cancer, lung cancer is also included. Many attributes were associated with an increased risk of lung cancer and others a lower risk. As expected, key attributes showing a higher risk included older age, lack of physical activity, COPD, hypertension, other cancers and family history of other cancers, TB and family history of TB and financial status. Attributes associated with lower risk include intense physical activity, younger age, never smokers and higher socioeconomic status. As the results are from univariate linear regression the effect of confounding is apparent. For example, hypertension is associated with age.

The top ten combinations of attributes were selected which showed the best results in identifying lung cancers, out of many thousands of combinations (Table 3). The selected combinations contained attributes numbering from 7 to 11. The top performing combination included the following attributes: age; activity score; smoking score; any respiratory illness; hypertension; cancer; and Tuberculosis.

Table 3 Attributes included in the best performing models and cancer cases detected.

Full size table

We needed to test the performance of the 7-attribute combination henceforth referred to as the Kent & Medway risk prediction tool with the TLHC eligibility criteria. By applying these three criteria to the 30% test population we identified on average 56,663 people (screening cohort) who will be eligible under the TLHC criteria. Among these there were 581 lung cancer cases recorded. We then applied the Kent & Medway risk prediction tool to the same 30% test population, and this predicted a lung cancer risk score for every individual. From this list, we identified the top 56,663 people and within this population 822 lung cancer cases were recorded. This was on average a benefit of 41.4% over and above the contemporaneous approach.

Discussion

Our study is an attempt to develop a lung cancer risk prediction tool to identify sections of the population at a higher risk of developing lung cancer. We utilised data both at person and spatial level including on social, demographic, lifestyle and clinical features and used the power of ML to achieve our objective. We initially identified 16 attributes that could predict the population at a higher risk of lung cancer. Our objective was to increase the power of cancer detection in a defined population as the current targeted TLHC eligibility criteria [32] are too broad and blunt. By running simultaneous models using boot strapping we were able to test numerous combinations of attributes running into tens of thousands of model runs which provided us with the best model with 7 attributes. We adopted a linear regression model which is different to others who have employed a suite of models [33, 34] in lung cancer prediction literature. This is because our objective was to identify a cohort of people at higher risk of lung cancer so that they can be targeted for screening. There is a linear association with many known attributes and risk of lung cancer. Furthermore, lung cancer risk score which is our main outcome of interest is a continuous variable and hence logistic regression is not applicable here. Use of ML has been proposed and adopted in reading computer tomography images [34]. However, in our study we used data points derived from routine linked administrative data sets which contained information on every patient irrespective of their clinical characteristics to predict their risk of lung cancer by exploiting the potential of ML. It may be surprising that the data on smoking status was almost complete although, this is not usually the case especially in Primary Care but shows continuous improvement [35]. The potential reasons for such high completeness in our study include the following: The KID being a linked dataset enabling smoking status to be captured from multiple points of care. Due to its specialist nature, a lot of efforts and resources have been spent to retrospectively ensure that the data is as complete as possible so that epidemiological research can be undertaken at the population level [22].

Clinical utility of the work

The product of this work has immediate clinical implications and thus has the potential to improve patient care and resource utilization. As the model outperforms the standard wider TLHC eligibility criteria, this would help us to detect up to 40% more cancers. Currently, we are exploring how best to incorporate this as a screening and early diagnosis intervention. There are two options under consideration: provide a more comprehensive and refined screening model based on our risk tool compared to that of the THLC eligibility criteria; and the GP calculates the risk score for each patient during a consultation, similar to Framingham cardiac risk score [36] and use this for further action. Using the first option, we can further refine the risk group for screening there by increasing cancer detection and saving scarce cancer diagnostic and treatment resources. We intend to incorporate the tool into the management information system of the early cancer diagnosis team at the local hospital as a pilot and then to roll it out across a wider geographical area. The first author has already secured agreement in principle for this from the local cancer clinical and managerial leaders.

Strengths

We used a place-based linked data set entirely produced by a local health system whose primary use was for commissioning intelligence and health care planning purposes. It has the power of painting the entire picture of the population as it contains information from general practice, community health services, mental health services and hospital services. Furthermore, it included integrated spatial-level information on key socioeconomic factors and the extent of deprivation. This makes it a powerful repository to develop any risk prediction tool compared to tools that only rely on electronic patient clinical records [37]. Our data is complete compared to Callender et al. [38] where there are large number of missing values. We generated relative risks at a very granular level of detail in order to develop our aggregated sixteen attributes. We established a powerful partnership of cancer clinicians, Public Health physicians, epidemiologists, ML experts and leaders from the cancer alliance who were involved throughout from the inception of the project to its completion. This helped us to incorporate varying perspectives. Key stakeholders’ views were constantly sought and acted upon during this work. These included regular meetings with the early diagnosis team, digital cancer alliance board, shared health and care analytics board and regional applied research consortium digital innovation group. Patients and the public are represented in most of these in order to ensure that there is support for this initiative.

Limitations

A few limitations of our study need to be acknowledged. All the seven variables included in the model had complete data although this does need to be treated with the following caution. For the activity score, we used the data at a population level i.e. lower layer super output area. This does not reflect the score for an individual-specific patient. Data on socio economic status was also only available at a spatial level. Although using data at geographical/spatial level gives us the advantage of complete data with no missing values, one needs to be cognisant of the limitations of this approach and the well-documented ecological fallacy [39]. Four of the variables in the final model were purely clinical conditions. These are: Any Respiratory Illness, Hypertension, Cancer, and Tuberculosis. It is extremely unlikely that such an important diagnosis will be left uncoded both in primary and secondary care. It is generally agreed that if such a clinical diagnosis does not appear on the patient record, the patient does not have the condition as it is not current practice to code that a patient does not have a condition. We recognise that this may not be universally true for all patients, but is unlikely to have a significant impact upon our longitudinal study results. Both for passive smoking and family history of cancer we assumed that if this information is not coded then the individual does not have that exposure although this may not be always accurate. As our analysis included over a million records any under/over assumption is likely to be random and will not have a major impact on the results. Ethnicity was not included in the model because the data was incomplete. In the future, we will ensure that ethnicity is included in further work. Data included in the study is only up to 2019.

We wish to acknowledge that we have not used traditional parameters to express the validity of a screening test as this approach is not applicable as explained in the model evaluation section. We have used a different approach to evaluate the model. It is the authors’ belief that the approach adopted in this study still adds useful information to the literature as this method has been seldom applied. This needs to be borne in mind when interpreting the findings and developing any policy approach based upon our findings. Due to changes in commissioning arrangements, the KID was rendered static and data were not updated after 2019. We do not anticipate any weakening of the power of the prediction tool due to non-inclusion of more recent data. This study was undertaken in Kent & Medway in the southeast of England. Hence the question of generalisability across the United Kingdom needs to be considered. In our view, it is unlikely that the population and the strength of association between the attributes and lung cancer are so different elsewhere that the results will not be applicable. However, this may not be true for an international comparison. Another important limitation worthy of note is that applying similar machine-learning approaches using other databases with different characteristics may result in a less sensitive outcome. Hence, before our approach is adopted this needs to be tested on a much larger patient population under different settings.

Conclusion

In this paper, we have demonstrated the useful application of Machine Learning in developing a risk score for lung cancer using a large, place-based linked data set. We involved multidisciplinary stakeholders throughout this work, including patients and the public. Our risk prediction tool is superior to the eligibility criteria currently in use in the pilot sites for the TLHC Programme. This is a good example where local experts in fields as diverse as AI, ML, clinical oncology, Public Health and Epidemiology came together to produce an innovative solution to improve patient care and save scarce health care resources.

Data availability

The data are not publicly available as the KID contains pseudonymised person-level linked data. However, access to data can be requested via the SHcAB.

References

Torre LA, Siegel RL, Jemal A. Lung cancer statistics. Lung cancer and personalized medicine: current knowledge and therapies. USA: Springer Cham; 2016. p. 1–9.
Google Scholar
Aggarwal A, Lewison G, Idir S, Peters M, Aldige C, Boerckel W, et al. The state of lung cancer research: a global analysis. J Thorac Oncol. 2016;11:1040–50.
Article PubMed Google Scholar
Cancer Research UK. Lung Cancer Statistics. Cancer Research UK. [Internet]. Available from: https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/lung-cancer. Accessed 8 Jun 2023.
Peto R, Lopez AD, Boreham J, Thun M. Mortality from smoking in developed countries, 1950–2010. 2011. Internet. https://tobaccocontrol.bmj.com/content/suppl/2012/02/22/tobaccocontrol-2011-050294.DC1/tobaccocontrol-2011-050294-s1.pdf Accessed 8 Sep 2023.
Thandra KC, Barsouk A, Saginala K, Aluru JS, Barsouk A. Epidemiology of lung cancer. Contemp Oncol/Współczesna Onkologia. 2021;25:45–52.
Article CAS PubMed Google Scholar
National Institute of Health and Care Excellence. Suspected cancer: recognition and referral. NICE guideline [NG12] Published: 23 June 2015 Last updated: 15 December 2021. Internet. https://www.nice.org.uk/guidance/ng12. Accessed June 2023.
Rogers TK. Minimising diagnostic delay in lung cancer. Thorax. 2019;74:319–20.
Article PubMed Google Scholar
McPhail S, Johnson S, Greenberg D, Peake M, Rous B. Stage at diagnosis and early mortality from cancer in England. Br J Cancer. 2015;112:S108–15.
Article PubMed PubMed Central Google Scholar
Cancer Research UK. Why is early diagnosis important? [Internet]. Available from: https://www.cancerresearchuk.org/about-cancer/cancer-symptoms/why-is-early-diagnosis-important. Accessed 18 Jun 2023.
Corral J, Espinàs JA, Cots F, Pareja L, Solà J, Font R, et al. Estimation of lung cancer diagnosis and treatment costs based on a patient-level analysis in Catalonia (Spain). BMC Health Serv Res. 2015;15:1–0.
Article Google Scholar
GOV.UK Press Release. New lung cancer screening roll out to detect cancer sooner. [Internet]. https://www.gov.uk/government/news/new-lung-cancer-screening-roll-out-to-detect-cancer-sooner Accessed 07th September 2023.
Hamet P, Tremblay J. Artificial intelligence in medicine. Metabolism. 2017;69:S36–40.
Article CAS Google Scholar
Chiu HY, Chao HS, Chen YM. Application of artificial intelligence in lung cancer. Cancers. 2022;14:1370
Article CAS PubMed PubMed Central Google Scholar
Hindman M. Building better models: prediction, replication, and machine learning in the social sciences. Ann Am Acad Political Soc Sci. 2015;659:48–62.
Article Google Scholar
Cassidy A, Duffy SW, Myles JP, Liloglou T, Field JK. Lung cancer risk prediction: a tool for early detection. Int J Cancer. 2007;120:1–6.
Article CAS PubMed Google Scholar
Public Health England. NHS population screening: care pathways [Internet]. 2021. Available from: https://www.gov.uk/government/collections/nhs-population-screening-care-pathways. Accessed 26 May 2023.
GOV.UK UK Screening – Adult Screening Programme Lung Cancer. [Internet]. https://view-health-screening-recommendations.service.gov.uk/lung-cancer/. Accessed 7 Sep 2023.
Crosbie PA, Balata H, Evison M, Atack M, Bayliss-Brideaux V, Colligan D, et al. Second round results from the Manchester ‘Lung Health Check’ community-based targeted lung cancer screening pilot. Thorax. 2019;74:700–4.
Article PubMed Google Scholar
Crosbie PA, Balata H, Evison M, Atack M, Bayliss-Brideaux V, Colligan D, et al. Implementing lung cancer screening: baseline results from a community-based ‘Lung Health Check’ pilot in deprived areas of Manchester. Thorax. 2019;74:405–9.
Article PubMed Google Scholar
Office for National Statistics. Population and Migration - Population Projections. ONS. [Internet]. Available from: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationprojections Accessed 8 Sep 2023.
Macmillan Cancer Support. 2022 Cancer Statistics Factsheet. Macmillan.org.uk. [Internet]. Available from: https://www.macmillan.org.uk/dfsmedia/1a6f23537f7f4519bb0cf14c45b2a629/9468-10061/2022-cancer-statistics-factsheet Accessed 8 Sep 2023.
Lewer D, Bourne T, George A, Abi-Aad G, Taylor C, George J. Data resource: the Kent integrated dataset (KID). Int J Popul Data Sci. 2018;3:427.
CAS PubMed PubMed Central Google Scholar
Statistical Bulletin. 2021 Mid-year population estimates: age and sex profile. Kent analytics. 2023. Available online: https://www.kent.gov.uk/__data/assets/pdf_file/0019/14725/Mid-year-population-estimates-age-and-gender.pdf. Accessed 23 Mar 2023.
Health & Social Care Maps. PDF Social Care Maps. KPHO. [Internet]. Available from: https://www.kpho.org.uk/joint-strategic-needs-assessment/health-and-social-care-maps/pdf-social-care-maps. Accessed 8 Sep 2023.
Annual Public Health Report - APHR 2021. KPHO. [Internet]. Available from: https://www.kpho.org.uk/__data/assets/pdf_file/0003/138270/Kent-APHR-2021-Coastal-Communities.pdf Accessed 8 Sep 2023.
Potdar K, Pardawala TS, Pai CD. A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl. 2017;175:7–9.
Google Scholar
Carr LL, Jacobson S, Lynch DA, Foreman MG, Flenaugh EL, Hersh CP, et al. Features of COPD as predictors of lung cancer. Chest. 2018;153:1326–35.
Article PubMed PubMed Central Google Scholar
Tenkanen L, Teppo L, Hakulinen T. Smoking and cardiac symptoms as predictors of lung cancer. J Chronic Dis. 1987;40:1121–8.
Article CAS PubMed Google Scholar
Nguyen QH, Ly H-B, Ho LS, Al-Ansari N, Le VH, Tran VQ, et al. Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math Prob Eng. 2021;. 2021:15 https://doi.org/10.1155/2021/4832864. volArticle ID 4832864 pages
Article Google Scholar
Python [Internet]. Available from: https://www.python.org/about/ Accessed 8 Sep 2023.
Marcus MW, Field JK. Is bootstrapping sufficient for validating a risk model for selection of participants for a lung cancer screening program? J Clin Oncol. 2017;35:818–9.
Article PubMed Google Scholar
Lung health checks in Kent. Internet. https://www.kentandmedway.icb.nhs.uk/your-health/local-services/kent-and-medway-cancer-alliance/lung-checks Accessed 22 Jun 2023.
Dritsas E, Trigka M. Lung cancer risk prediction with machine learning models. Big Data Cogn Computi. 2022;6:139.
Article Google Scholar
Kadir T, Gleeson F. Lung cancer prediction using machine learning and advanced imaging techniques. Transl Lung Cancer Res. 2018;7:304.
Article PubMed PubMed Central Google Scholar
Simpson CR, Hippisley-Cox J, Sheikh A. Trends in the epidemiology of smoking recorded in UK general practice. Br J Gen Pract. 2010;60:e121–7.
Article PubMed PubMed Central Google Scholar
MDCalc. Framingham Risk Score (Hard Coronary Heart Disease). [Internet]. Available from: https://www.mdcalc.com/calc/38/framingham-risk-score-hard-coronary-heart-disease. Accessed 20 Jun 2023.
Raghu VK, Walia AS, Zinzuwadia AN, Goiffon RJ, Shepard JA, Aerts HJ, et al. Validation of a deep learning–based model to predict lung cancer risk using chest radiographs and electronic medical record data. JAMA Network Open. 2022;5:e2248793.
Article PubMed PubMed Central Google Scholar
Callender T, Imrie F, Cebere B, Pashayan N, Navani N, Van der Schaar M et al. Assessing eligibility for lung cancer screening: Parsimonious multi-country ensemble machine learning models for lung cancer prediction. medRxiv. 2023;29:2023-01.
Idrovo AJ. Three criteria for ecological fallacy. Environ Health Perspect. 2011;119:A332.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We are grateful to the SHcAB for granting us permission to access and use the data. We acknowledge the support of Kent & Medway Cancer Alliance. We sincerely thank Dr Anjan Ghosh, Director of Public Health, Kent County Council for his support and encouragement.

Funding

The first and second authors received funding from the Kent and Medway Cancer Alliance to undertake the analysis.

Author information

Authors and Affiliations

Quantum Analytica, Berkshire, UK
David Howell & Ross Buttery
Surrey Heartlands Integrated Care System, Surrey, UK
David Howell
Public Health Medicine, Kent County Council, Maidstone, England, UK
Padmanabhan Badrinath & Abraham George
University of Cambridge, Cambridge, UK
Padmanabhan Badrinath
Kent and Medway Medical School, Kent, UK
Abraham George
Vellore Institute of Technology, Vellore, Tamil Nadu, India
Rithvik Hariprasad
Thames Valley Cancer Alliance, Reading, UK
Ian Vousden
NHS England - South East, Southampton, UK
Ian Vousden
Kent & Medway Cancer Alliance, Maidstone, UK
Tina George
Targeted Lung Health Checks, Sussex, UK
Tina George
NHS Sussex Integrated Care Board, Worthing, England, UK
Tina George
Cancer Research UK GP, London, UK
Tina George
Early Cancer Diagnosis and Cancer Health Inequalities, Kent and Medway Cancer Alliance, Maidstone, UK
Cathy Finnis

Authors

David Howell
View author publications
You can also search for this author in PubMed Google Scholar
Ross Buttery
View author publications
You can also search for this author in PubMed Google Scholar
Padmanabhan Badrinath
View author publications
You can also search for this author in PubMed Google Scholar
Abraham George
View author publications
You can also search for this author in PubMed Google Scholar
Rithvik Hariprasad
View author publications
You can also search for this author in PubMed Google Scholar
Ian Vousden
View author publications
You can also search for this author in PubMed Google Scholar
Tina George
View author publications
You can also search for this author in PubMed Google Scholar
Cathy Finnis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the publication according to the ICMJE guidelines for authorship. All authors read and approved the submitted version of the manuscript. Each author has agreed both to be personally accountable for the author’s own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature. Study concept and design: DH, RB, AG. Acquisition of the data: DH, RB, AG. Analysis and interpretation of data: DH, RB, AG, PB, RH, IV, CF, TG. Drafting of the manuscript: DH, RB, PB, AG, RH, IV, CF, TG. Statistical analysis: DH, RB, PB, AG, RH. Manuscript review and approval: DH, RB, PB, AG, RH, IV, CF, TG. Obtained funding: DH, IV.

Corresponding author

Correspondence to David Howell.

Ethics declarations

Competing interests

Two of the authors are directors of Quantum Analytica.

Ethics

Ethical approval was not required as this work was undertaken as part of the authors’ job role and as a service activity to inform health care planning and delivery.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Dr Tina George: The views expressed here are the professional views of the author and in no way represent the views of all the organisations this author has been associated with, at present or in the past.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article–s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article–s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Howell, D., Buttery, R., Badrinath, P. et al. Developing a risk prediction tool for lung cancer in Kent and Medway, England: cohort study using linked data. BJC Rep 1, 16 (2023). https://doi.org/10.1038/s44276-023-00019-5

Download citation

Received: 23 June 2023
Revised: 08 September 2023
Accepted: 26 September 2023
Published: 17 October 2023
DOI: https://doi.org/10.1038/s44276-023-00019-5

Abstract

Background

Methods

Results

Conclusion

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Genome-wide association studies

An overview of clinical decision support systems: benefits, risks, and strategies for success

Introduction

Methods

The County of Kent

Dataset description

Data access

Data pre-processing

Model development

Model evaluation

Results

Discussion

Clinical utility of the work

Strengths

Limitations

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links