Introduction

Lung cancer is one of the major causes of death worldwide [1, 2]. In the UK, 48,500 new lung cancer cases are detected every year of which 34,800 die, accounting for 21% of all cancer deaths during 2017–2019 [3]. An estimated 86% of lung cancer deaths in the UK are caused by tobacco smoking [4]. Furthermore, there is an association with prolonged environmental exposure to air pollutants such as sulphur dioxide, nitrogen oxides, nitrogen dioxides, or arsenic. Hence, nations with greater pollution levels are likely to have higher incidences of lung cancer [5]. Until the advent of the Targeted Lung Health Check (TLHC) pilots, it was only when a person started to exhibit the symptoms of lung cancer, that a diagnosis of the disease could be made. Some of these symptoms could include coughing, shortness of breath, unexplained weight loss, wheezing, haemoptysis, chest discomfort, exhaustion and decreased appetite [6].

Lung cancer outcomes have improved only marginally over the last 40 years and remain poor in comparison to most other cancers—just 17.7% of women and 12.9% of men in the UK survive after diagnosis for 5 years or longer [7]. The lack of overt or specific symptoms in the early stages of lung cancer often leads to late presentations, resulting in delayed diagnosis and treatment [8]. However, early detection and diagnosis, followed by effective treatment, improves survival for nearly all cancer types. According to Cancer Research UK [9] “around 6 in 10 people with lung cancer survive their disease for 5 years or more, if diagnosed at the earliest stage. This falls to <1 in 10 people when lung cancer is diagnosed at the most advanced stage.”

When diagnosed early, more treatment options are available for lung cancer, including surgical resections. If operable, primary treatment costs are largely attributable to surgical removal procedures. However, as the disease advances to Stages 3 and 4, the expenses associated with surgical interventions tend to decrease, whilst the costs related to systemic therapies escalate significantly. This shift in treatment modalities is primarily due to the diminished feasibility of surgical removal as the cancer becomes more widespread. Instead, systemic therapies become more pivotal at advanced stages, aiming to control tumour growth, alleviate symptoms and potentially prolong survival. Consequently, timely identification and detection of lung cancer can significantly alleviate the financial burden on the state, the insurer, patients, and their families. This includes mitigating the expenses associated with advanced-stage treatments, extended hospital stays, intensive therapies, and palliative care services [10].

In the quest for earlier diagnosis of lung cancer, in June 2023, the UK government announced plans for a new national targeted lung cancer screening programme, based on learning from existing Targeted Lung Health Check (TLHC) pilot sites. The programme, which is supported by a recommendation from the UK National Screening Committee, will invite patients aged between 55 and 74 who are current or former smokers for a lung health check, which may include a low-dose CT scan. In areas where the TLHC programme has been operating, early data suggests that approximately 76% of lung cancers are diagnosed at stages 1 and 2, which is a substantial improvement compared with usual pathways of care [11].

Artificial Intelligence (AI) is a new and rapidly evolving field where computers are taught to think like humans. Due to its enhanced accuracy, precision, and decision-support capabilities, AI has begun to be implemented in modern medicine. It is being used in two ways namely, physical and virtual. Physical applications of AI include robots that are automated to perform tasks such as caring for the elderly and others that assist in surgeries. Machine learning (ML) is a subfield of AI that deals with the virtual aspect. ML models can be trained to detect or predict occurrences of a health condition [12]. AI is suitable in the medical field as it has no concept of fatigue unlike doctors and therefore can process large number of images and data at any given time [13]. This requires a good prediction model to be designed which involves acquiring a large dataset for training the model. The bigger and more diverse the dataset is, better the results that can be expected from it [14]. However, researchers need to be aware that quality, curation, and expert annotation are vitally important while considering what data to include.

With the help of AI, we can make accurate assessments of one’s risk of lung cancer. The detection or prediction of lung cancer serves as a prime illustration where the utilisation of AI is indispensable. This is due to the fact that lung cancer is a highly time-sensitive condition and early diagnosis can be difference between life and death. Risk factors associated with lifestyle choices can be used to provide profiles of potential risks. The objective of any risk prediction tool, such as the one described in this paper, is to identify a small fraction of the population in which a large proportion of the disease cases will occur [15].

The National Screening Committee has recommended population screening for lung cancer as targeted lung cancer screening with low-dose Computerised Tomography is cost-effective at a threshold of £20,000 per QALY [16, 17]. Current attempts to improve early lung cancer diagnosis involve diagnostically evaluating large volumes of individuals with less than 1% of successful case identification [18, 19]. The population of England is estimated to increase by 6% over the next decade [20]. Furthermore, there has been a 19% increase in the prevalence of cancer in England over the last decade and published figures on the number of people waiting for a diagnosis or treatment for cancer have shown the huge challenge facing NHS cancer services, with tens of thousands of people waiting too long for diagnosis or vital treatment, especially since the start of the pandemic of COVID-19 [21]. Hence, the NHS cannot afford to provide existing healthcare in the same way in the future and will not have a sufficient workforce to deliver this. This challenge is not just isolated to the UK but is a common issue worldwide.

Our study aims to address the challenge of delayed diagnosis of lung cancer by exploiting the processing power of AI. We developed a model for providing risk-based predictions of lung cancer based on an individual’s lifestyle choices, family history and other clinical data. We had access to a large dataset consisting of 1.25 million adult residents across the Kent and Medway region called the Kent Integrated Dataset (KID) [22]. We harnessed the capabilities of ML to train the model in making risk predictions by extracting patterns from data records of residents who had been diagnosed and treated for lung cancer. Our objective was to find the best performing model among a group of ML models that gave accurate predictions of the risk of lung cancer.

Methods

The County of Kent

Kent County Council covers the largest population footprint of any other council in England with a population of 1.6 million [23]. It has an exceptional spread of affluence and extreme poverty. Before COVID, a life expectancy gap of almost 20 years already existed between the least and most deprived wards [24, 25].

Dataset description

Data for this study was taken from the KID [22], which contains a vast array of pseudonymised integrated health and care data. The data for KID are derived from various sources. Nearly 40% of the data is from secondary care, over ¼ from primary care and the rest are from a range of sources including community and mental health trust providers and other publicly available data at a spatial level. The KID is overseen by a steering group known as the Kent & Medway Shared Health and Care Analytics Board (SHcAB) that includes representatives of Kent County Council, local health commissioners and information governance leads. The SHcAB considers issues such as information governance, development of the dataset and applications for use of the data. The Kent and Medway data warehouse team provides day-to-day administration and project management. Access was granted to the first author by the SHcAB for the study duration through established due process. Patients can opt-out of contributing data to the KID by informing their GP surgery that they do not want their data to be shared with external organisations. It has to be appreciated that the data is not in the public domain and it is a pseudonymised person level data set for most of the variables. We established a project oversight group, supported by the Kent & Medway cancer alliance which included cancer clinicians, service managers, Public Health physicians, epidemiologists, and AI experts. Regular stakeholder engagement took place throughout the study involving patients and public representatives.

Data contained within the KID represented a 6-year longitudinal record of health and care data for residents for 2014–2019 which was 1,865,382. An initial exclusion for under 18s years was made (n = 599,866) which reduced the cohort to 1,265,516. We then removed a further 10,532 patients (0.8% of the total population), due to incomplete or missing records data (for example smoking status), which took the original cohort size down to 1,254,984. We used a set of pre-determined criteria to exclude the records with missing data. Given that recording of ethnicity is poor across the NHS, we did not use it as an exclusion criteria. We excluded records where data on one or more key variables relevant to our analysis were missing. These are: Pseudonymised Unique Patient ID, Smoking Status of the individual, GP Practice of Registration, Age, Gender, and valid Postcode. The final dataset contained a total of 1,254,984 patients, of which 6053 were diagnosed with primary lung cancer during this period and these were included within the scope of this investigation. The final dataset used in the analysis had no missing data on smoking status. The cohort selection (lung cancer cohort) was only made up of patients with primary malignant lung cancers, excluding benign tumours and metastases from other types of cancer. To ensure comprehensive capture of all patients meeting the criteria, we assessed both primary and secondary healthcare records using relevant SNOMED or ICD-10 codes, respectively. Patients with Lung Cancer included all confirmed diagnoses regardless of diagnosis of care setting, staging at the time of diagnosis, disease progression or onward treatment options and outcomes. Core dimensions of data used within this study are shown below:

  • Patient Demographics

  • Primary Care (Events, Consultations, Long term condition registers, Medications, Deaths)

  • Secondary Care (A&E, Inpatient Spells and Outpatients, Critical Care Bed Days)

  • Mental Health (Inpatient and Outpatient History)

  • Community Care (Contacts, Appointments, Minor Injuries Units and Walk In Centers)

  • Wider Health Determinants including Housing, Education, Employment, and Income.

  • Environmental Datasets—Pollution, Radon ground levels

We did not have information on all the above variables at an individual patient level. We had individual patient-level data on patient demographics, primary, secondary care, mental health and community clinical care activities. For the wider determinants of health including environmental factors, we applied spatial level data at the Lower layer Super Out Put area, a small geographical area in the UK with an average of 650 households to the patient level datasets.

Data access

All NHS organisations including general practices across Kent & Medway had entered into Joint data Controller arrangements, which includes a common process for safe, secure and lawful access to their data in the KID for population health analytics including work such as ours. This process is administered by a system wide oversight group representing the organisations, called the Kent & Medway Shared Health & Care Analytics Board. Patient-level consent would not apply in this context as the dataset is historical and fully pseudonymised and deidentified. Because of the above arrangements, access to the data in KID, its analysis and sharing of the findings, no ethical approval was required as per existing arrangements.

Data pre-processing

The dataset contained missing values mainly in the attribute named ‘ethnicity’ as shown in Table 1, despite a lot of work to try and capture ethnicity coding from various sources. We, therefore, excluded this from the model as we felt that it was not appropriate to try and use average value or synthetic data derivative, which is common practice. Other dataset attributes had no missing or outlier values from features, so no further transformations were made on the remainder of the datasets.

Table 1 Baseline characteristics of cohort groups.

The data attributes are grouped into life history, symptoms, diagnostics, treatment, and end-of-life care based on the stage at which the data are collected, as depicted in Fig. 1. To prepare the model for predicting patients’ risk ratios, we extracted only the essential attributes from the dataset. These columns were selected based on their potential to provide valuable predictive information. We specifically focused on data concerning the pathways leading to the diagnosis of lung cancer as it held valuable insights regarding the associated causes and symptoms. Attributes related to cancer diagnosis or data related to 2-week wait urgent referrals, appointments to see an oncologist, Chest X-Rays and Low Dose Computerised Tomography scans for confirming diagnosis, treatment options such as chemotherapy and radiotherapy and mortality were omitted. These attributes were excluded from the dataset because they were deemed as non-predictive elements that did not offer significant insights into the associated risks of a positive diagnosis of lung cancer. We excluded the above diagnostics and treatment elements up to 12 months before the date of diagnosis.

Fig. 1: Pathways leading up to and beyond a Lung Cancer Diagnosis for patients.
figure 1

The model uses only life history and symptoms as predictive elements for a lung cancer diagnosis. Diagnostic elements, treatment and end of life care features were omitted.

Relative risks (RR) were calculated for all the variables and were used to determine the important attributes and for categorisation. RR is the ratio of the incidence of an event occurring (Lung Cancer) with an exposure (e.g., smoking) versus the incidence of the same event occurring without the exposure. For example, the relative risk of developing lung cancer in smokers (the exposed group) versus non-smokers (non-exposed group) would be the probability of developing lung cancer for smokers divided by the probability of developing lung cancer for non-smokers. All characteristics of the individual datasets such as medications, events, tests, demographic qualities or wider determinant of health factors were tested, and risk-scored using this methodology. To reduce the number of categories we collapsed these into meaningful groupings, and these were informed by the higher relative risk of related variables. For instance, for respiratory disorders such as COPD and Asthma each of which have numerous diagnosis codes, these were built up into simple three-state options; Yes, No or Has Familial History. Other features, such as smoking history and activity with high dimensionality were ranked into similar groups by creating scores.

Model development

We used feature encoding to reduce the number of states and to simplify the complexity of model development and enhance performance. One-hot encoding and standard scaling was used for the feature encoding [26]. Given the need to develop a scalar response to risk scoring in order to aid prioritisation of patients at greatest risk of developing lung cancer within a screening pool, logistical and other categorical models were ruled out. Traditional linear regression was selected as an initial candidate model to detect lung cancers early and thereby improving outcomes over and above the current screening protocol for lung cancer in the UK.

Using a combination of methods namely informed by the data, proposals from clinical experts and published literature [27, 28], 16 attributes were identified. We took our entire population data for n attributes, which could be anywhere between 2 and 16, and split this into 70% training and 30% validation datasets [29]. We then used the 70% dataset to build a linear regression model on these n attributes. We developed a loop within Python [30] to identify all the possible combinations of these 16 attributes in their ability to detect lung cancer. We applied this model for n attributes to the 30% test population to achieve an output which is number of lung cancer cases detected. This was repeated one hundred times (Fig. 2) in order to create multiple outputs that could be averaged to test the models’ repeatability and for onward evaluation. We then employed boot strapping [31] to test the general ability of the model to work across randomised populations. In each run, both the 70% training set and the 30% validation set were again randomised to eliminate any potential biases or chance influences. This randomisation also aimed to provide comprehensive average performance statistics for all models. In each model run the TLHC eligibility criteria were applied, and the number of cancers counted. This was compared to the highest risk scored patients identified by the prediction model, keeping both the screening cohort sizes equal.

Fig. 2: Steps involved from the beginning to the end of the study process.
figure 2

This spans from extracting relevant data from the KID to comparing the number of lung cancer cases detected using the most successful model and the criteria used in the TLHC programme.

Model evaluation

The output of this model is not binary/logistical (with or without cancer) but a continuum of risk of developing the cancer. As we have stated within the dataset description section, the dataset we used also did not contain person-level information on all the variables included in the model. Hence, the traditional parameters to express the validity of a screening test such as sensitivity, specificity, positive predictive value, negative predictive value, area under the curve and likelihood ratios are not applicable. Instead, we rationalised that if the model is working most efficiently, we should be able to demonstrate more lung cancer cases being found within a screening pool in the population compared to that of the current screening pilots ongoing in England. In order to baseline our evaluation, therefore, we compared the output of the model against the current screening population for the TLHC [32] programme. Patients meeting the following three criteria will be invited for screening:

  • are over 55 but younger than 75 years old

  • are registered with an GP in the area the scheme is operating

  • have ever smoked, and this is recorded with the GP.

This number of cases found from the TLHC programme was then compared with the number of cases identified using the linear regression model using the top-performing combination of attributes.

Results

Selected characteristics of cohorts included in the study are shown in Table 1.

Relative risks for the attributes included in the model are presented in Table 2.

Table 2 Relative risks for the attributes and various levels of exposures included in the model.

In the attribute concerning family history of cancer, lung cancer is also included. Many attributes were associated with an increased risk of lung cancer and others a lower risk. As expected, key attributes showing a higher risk included older age, lack of physical activity, COPD, hypertension, other cancers and family history of other cancers, TB and family history of TB and financial status. Attributes associated with lower risk include intense physical activity, younger age, never smokers and higher socioeconomic status. As the results are from univariate linear regression the effect of confounding is apparent. For example, hypertension is associated with age.

The top ten combinations of attributes were selected which showed the best results in identifying lung cancers, out of many thousands of combinations (Table 3). The selected combinations contained attributes numbering from 7 to 11. The top performing combination included the following attributes: age; activity score; smoking score; any respiratory illness; hypertension; cancer; and Tuberculosis.

Table 3 Attributes included in the best performing models and cancer cases detected.

We needed to test the performance of the 7-attribute combination henceforth referred to as the Kent & Medway risk prediction tool with the TLHC eligibility criteria. By applying these three criteria to the 30% test population we identified on average 56,663 people (screening cohort) who will be eligible under the TLHC criteria. Among these there were 581 lung cancer cases recorded. We then applied the Kent & Medway risk prediction tool to the same 30% test population, and this predicted a lung cancer risk score for every individual. From this list, we identified the top 56,663 people and within this population 822 lung cancer cases were recorded. This was on average a benefit of 41.4% over and above the contemporaneous approach.

Discussion

Our study is an attempt to develop a lung cancer risk prediction tool to identify sections of the population at a higher risk of developing lung cancer. We utilised data both at person and spatial level including on social, demographic, lifestyle and clinical features and used the power of ML to achieve our objective. We initially identified 16 attributes that could predict the population at a higher risk of lung cancer. Our objective was to increase the power of cancer detection in a defined population as the current targeted TLHC eligibility criteria [32] are too broad and blunt. By running simultaneous models using boot strapping we were able to test numerous combinations of attributes running into tens of thousands of model runs which provided us with the best model with 7 attributes. We adopted a linear regression model which is different to others who have employed a suite of models [33, 34] in lung cancer prediction literature. This is because our objective was to identify a cohort of people at higher risk of lung cancer so that they can be targeted for screening. There is a linear association with many known attributes and risk of lung cancer. Furthermore, lung cancer risk score which is our main outcome of interest is a continuous variable and hence logistic regression is not applicable here. Use of ML has been proposed and adopted in reading computer tomography images [34]. However, in our study we used data points derived from routine linked administrative data sets which contained information on every patient irrespective of their clinical characteristics to predict their risk of lung cancer by exploiting the potential of ML. It may be surprising that the data on smoking status was almost complete although, this is not usually the case especially in Primary Care but shows continuous improvement [35]. The potential reasons for such high completeness in our study include the following: The KID being a linked dataset enabling smoking status to be captured from multiple points of care. Due to its specialist nature, a lot of efforts and resources have been spent to retrospectively ensure that the data is as complete as possible so that epidemiological research can be undertaken at the population level [22].

Clinical utility of the work

The product of this work has immediate clinical implications and thus has the potential to improve patient care and resource utilization. As the model outperforms the standard wider TLHC eligibility criteria, this would help us to detect up to 40% more cancers. Currently, we are exploring how best to incorporate this as a screening and early diagnosis intervention. There are two options under consideration: provide a more comprehensive and refined screening model based on our risk tool compared to that of the THLC eligibility criteria; and the GP calculates the risk score for each patient during a consultation, similar to Framingham cardiac risk score [36] and use this for further action. Using the first option, we can further refine the risk group for screening there by increasing cancer detection and saving scarce cancer diagnostic and treatment resources. We intend to incorporate the tool into the management information system of the early cancer diagnosis team at the local hospital as a pilot and then to roll it out across a wider geographical area. The first author has already secured agreement in principle for this from the local cancer clinical and managerial leaders.

Strengths

We used a place-based linked data set entirely produced by a local health system whose primary use was for commissioning intelligence and health care planning purposes. It has the power of painting the entire picture of the population as it contains information from general practice, community health services, mental health services and hospital services. Furthermore, it included integrated spatial-level information on key socioeconomic factors and the extent of deprivation. This makes it a powerful repository to develop any risk prediction tool compared to tools that only rely on electronic patient clinical records [37]. Our data is complete compared to Callender et al. [38] where there are large number of missing values. We generated relative risks at a very granular level of detail in order to develop our aggregated sixteen attributes. We established a powerful partnership of cancer clinicians, Public Health physicians, epidemiologists, ML experts and leaders from the cancer alliance who were involved throughout from the inception of the project to its completion. This helped us to incorporate varying perspectives. Key stakeholders’ views were constantly sought and acted upon during this work. These included regular meetings with the early diagnosis team, digital cancer alliance board, shared health and care analytics board and regional applied research consortium digital innovation group. Patients and the public are represented in most of these in order to ensure that there is support for this initiative.

Limitations

A few limitations of our study need to be acknowledged. All the seven variables included in the model had complete data although this does need to be treated with the following caution. For the activity score, we used the data at a population level i.e. lower layer super output area. This does not reflect the score for an individual-specific patient. Data on socio economic status was also only available at a spatial level. Although using data at geographical/spatial level gives us the advantage of complete data with no missing values, one needs to be cognisant of the limitations of this approach and the well-documented ecological fallacy [39]. Four of the variables in the final model were purely clinical conditions. These are: Any Respiratory Illness, Hypertension, Cancer, and Tuberculosis. It is extremely unlikely that such an important diagnosis will be left uncoded both in primary and secondary care. It is generally agreed that if such a clinical diagnosis does not appear on the patient record, the patient does not have the condition as it is not current practice to code that a patient does not have a condition. We recognise that this may not be universally true for all patients, but is unlikely to have a significant impact upon our longitudinal study results. Both for passive smoking and family history of cancer we assumed that if this information is not coded then the individual does not have that exposure although this may not be always accurate. As our analysis included over a million records any under/over assumption is likely to be random and will not have a major impact on the results. Ethnicity was not included in the model because the data was incomplete. In the future, we will ensure that ethnicity is included in further work. Data included in the study is only up to 2019.

We wish to acknowledge that we have not used traditional parameters to express the validity of a screening test as this approach is not applicable as explained in the model evaluation section. We have used a different approach to evaluate the model. It is the authors’ belief that the approach adopted in this study still adds useful information to the literature as this method has been seldom applied. This needs to be borne in mind when interpreting the findings and developing any policy approach based upon our findings. Due to changes in commissioning arrangements, the KID was rendered static and data were not updated after 2019. We do not anticipate any weakening of the power of the prediction tool due to non-inclusion of more recent data. This study was undertaken in Kent & Medway in the southeast of England. Hence the question of generalisability across the United Kingdom needs to be considered. In our view, it is unlikely that the population and the strength of association between the attributes and lung cancer are so different elsewhere that the results will not be applicable. However, this may not be true for an international comparison. Another important limitation worthy of note is that applying similar machine-learning approaches using other databases with different characteristics may result in a less sensitive outcome. Hence, before our approach is adopted this needs to be tested on a much larger patient population under different settings.

Conclusion

In this paper, we have demonstrated the useful application of Machine Learning in developing a risk score for lung cancer using a large, place-based linked data set. We involved multidisciplinary stakeholders throughout this work, including patients and the public. Our risk prediction tool is superior to the eligibility criteria currently in use in the pilot sites for the TLHC Programme. This is a good example where local experts in fields as diverse as AI, ML, clinical oncology, Public Health and Epidemiology came together to produce an innovative solution to improve patient care and save scarce health care resources.