Artificial intelligence outperforms standard blood-based scores in identifying liver fibrosis patients in primary care

For years, hepatologists have been seeking non-invasive methods able to detect significant liver fibrosis. However, no previous algorithm using routine blood markers has proven to be clinically appropriate in primary care. We present a novel approach based on artificial intelligence, able to predict significant liver fibrosis in low-prevalence populations using routinely available patient data. We built six ensemble learning models (LiverAID) with different complexities using a prospective screening cohort of 3352 asymptomatic subjects. 463 patients were at a significant risk that justified performing a liver biopsy. Using an unseen hold-out dataset, we conducted a head-to-head comparison with conventional methods: standard blood-based indices (FIB-4, Forns and APRI) and transient elastography (TE). LiverAID models appropriately identified patients with significant liver stiffness (> 8 kPa) (AUC of 0.86, 0.89, 0.91, 0.92, 0.92 and 0.94, and NPV ≥ 0.98), and had a significantly superior discriminative ability (p < 0.01) than conventional blood-based indices (AUC = 0.60–0.76). Compared to TE, LiverAID models showed a good ability to rule out significant biopsy-assessed fibrosis stages. Given the ready availability of the required data and the relatively high performance, our artificial intelligence-based models are valuable screening tools that could be used clinically for early identification of patients with asymptomatic chronic liver diseases in primary care.

The prevalence of fatty liver disease is increasing at the same rate as the epidemics of obesity and type 2 diabetes mellitus 1 . As the condition develops, inflammation and fibrosis of the liver appear and can progress to cirrhosis with associated morbidity and mortality. Early identification of asymptomatic patients can prevent that undetected liver fibrosis slowly and asymptomatically progresses to a severe and life-threatening chronic liver disease. Various tools for the non-invasive assessment of fibrosis have been developed in the last decades. They fall into three categories, 1) image-based technologies e.g. Transient Elastography (TE), that measures liver stiffness, 2) indirect markers e.g. Fibrosis 4 index (FIB-4), and 3) direct markers of fibrosis e.g. Enhanced Liver Fibrosis test (ELF). One of the main advantages of the indirect blood-based biomarkers and scores is that they can be used in evaluating liver fibrosis in most clinical settings, since they are very easy-to-use, quick and inexpensive. While these tools were originally developed, largely fueled by the hepatitis-C era, to diagnose significant fibrosis up-to compensated advanced chronic liver disease, they have primarily been validated in secondary and tertiary healthcare. Identifying those with significant liver fibrosis in primary care is a clinical challenge as the vast majority are asymptomatic and often have normal liver function tests. To date no tools have been developed and only few assessed or validated for a primary care population.
Up to now, algorithms based on indirect blood markers, have been typically developed using approaches that are standard in the epidemiological literature, such as parametric regression methods where the optimal set of predictors is identified by stepwise selection [2][3][4][5] . With the advent of computer systems and digitalization, large amount of healthcare data is routinely being collected and stored; but these data are frequently underused and undervalued. Artificial intelligence (AI) could help to improve the accuracy with which we can detect liver fibrosis by processing and automatically learning associations in this complex healthcare data. Among the different AI techniques, the use of ensemble learning is of particular interest, because it trains multiple machine learning algorithms, determines the optimal weights for combining the predictions, and results in an ensemble model that often outperforms any single machine learning algorithm. Despite the potential benefits, no studies have explored the use of ensemble learning methodology to identify significant liver fibrosis.
In this study, we developed a set of AI algorithms with different complexities, based on ensemble learning methodology, able to predict clinically significant liver stiffness (a well-established surrogate of biopsy-assessed liver fibrosis), using patient data that can be available at a low cost in primary care. We assessed their diagnostic performance, and conducted a head-to-head comparison with standard indirect indices (FIB-4, Forns index (Forns) and AST to platelet ratio index (APRI)). We then evaluated the ability of the ensemble learning models to effectively reduce the number of patients that undergo unnecessary TE investigations. Finally, we addressed the question as to whether a negative ensemble learning model result, calculated based on affordable patient data, could be used with confidence to rule out significant biopsy-assessed fibrosis, and so its potential role in eliminating unnecessary liver biopsies.

Study population.
We performed a prospective cohort study including patients from the Region of Southern Denmark between 2013 and 2020. The study population (n = 3460) consisted of subjects at risk of NAFLD (~ 43% of the participants), subjects at risk of alcohol-related liver disease (ALD) (~ 35%), and subjects randomly selected from the general population (~ 22%). We recruited subjects via three main channels: 1) invitation letters sent to randomly selected Danes, from the Odense University Hospital's catchment area by means of e-boks. E-boks is the official digital communication route between public authorities and Danish citizens; 2) from three alcohol rehabilitation centers; and 3) from in-and out-patients at the department of Gastroenterology and Hepatology at the Odense University Hospital of Southern Denmark. The non-participation rate was approximately 77% among subjects approached using channel 1 (i.e. general population), and 35% among subjects approached via channel 2 and 3.
At recruitment, none of the subjects had known liver diseases, and they were asymptomatic. We have published detailed study methods elsewhere 6,7 . All methods were carried out in accordance with relevant ethical guidelines and regulations based on the Declaration of Helsinki. We obtained informed consent in writing from each patient. The ethics committee in the Region of Southern Denmark (S-20120071; S-20170087) approved the study.
Input and output variables. We collected data on demographics, physical exam, clinical and laboratory parameters, and questionnaires, alongside comorbidities and medications, resulting in 233 potential input variables (Appendix). Liver stiffness measurement (LSM) assessed by TE using FibroScan, was dichotomized using an 8 kPa threshold according to the definition of clinically significant fibrosis 8,9 .
The objective was to build a model that, when applied to undiagnosed patients that are new to the model (unseen data), correctly classifies them into two classes: patients whose liver stiffness is expected to be > 8 kPa (i.e. "clinically significant LSM") and patients with expected liver stiffness ≤ 8 kPa (i.e. "not clinically significant LSM").
Our model should ideally strike a balance between model complexity and performance. Therefore, we built six different ensemble learning models (LiverAID models), of increasing complexity, in terms of the number of input variables required by the model (Fig. 1). LiverAID XXS relies exclusively on the 9 indirect blood-based biomarkers used in the calculation of standard indices (FIB-4, Forns, APRI and LiverTrail) 2-4,6 . (Fig. 1). LiverAID The ensemble learning models. We generated high-performance classifiers following an ensemble learning strategy 10 . Based on De Lell et al. 11 , each of our proposed ensemble learning models were built to combine and optimize the output from four basic learning algorithms (random forest, elastic net, bagging classification tress and support vector machine) (Appendix).
With a view to reliably assessing the generalization error, we performed a series of data splits (Fig. 2). We randomly extracted a subset (10%) of the data (hold-out dataset), which was not used in any way for training, validation or testing. This unseen data was exclusively reserved to perform the final assessment of the model's predictive performance. We applied a repeated random subsampling (RRS) strategy in which the remaining dataset was repeatedly and randomly split (resulting in 5 repetitions) into training (60%), validation (20%) and testing (20%). Synthetic samples were generated, only in the training dataset, to compensate for class imbalance. The optimal values for the hyperparameters of each model were obtained using the training dataset (Fig. 2). The validation datasets were then used to optimize the probability threshold value that resulted in an NPV ≥ 98%, while the testing datasets were used for a preliminary estimation of the generalization error. Finally, the predictive performance of each model was evaluated on the completely unseen hold-out dataset.
The ability of standard indices of liver fibrosis (APRI, FIB-4 or Forns index) to predict significant liver stiffness (i.e. LSM ≤ 8 kPa vs. LSM > 8 kPa) was evaluated by four logistic regressions models built using the same training datasets as in the LiverAID models: Three univariate models, using respectively, APRI, FIB-4 or Forns index as predictors; and one multivariate model with all three indices as predictors. We used the hold-out dataset to perform a head-to-head comparison of the predictive performance of each LiverAID model and that of the logistic regression models. We conducted all analyses in R.

Liver biopsy.
A portion of the study population was at a significant risk of liver disease that justified performing a liver biopsy 6,7 . We performed a percutaneous liver biopsy to these patients 6,7 , and obtained the liver fibrosis Kleiner stage (F0 to F4). We further evaluated the ability of: 1) LSM; 2) APRI, FIB-4 and Forns approaches, www.nature.com/scientificreports/ and 3) LiverAID models; to rule out significant liver fibrosis (defined as F2-F4 kleiner stages) in the hold-out dataset, using biopsy-assessed fibrosis stage as the reference standard. In this evaluation we mapped liver stiffness ≤ 8 kPa, to liver biopsy stage F0 and F1; and liver stiffness > 8 kPa to liver biopsy stage F2 to F4, where liver stiffness was measured in case (1) and predicted in cases (2) and (3).

Ethics committee approval. The ethics committee in the Region of Southern Denmark (S-20120071;
S-20170087) approved the study.

Results
Study population. Subject demographics, LSM results, biopsy-assessed fibrosis stage and serum markers measurements are presented in Table 1.

Head-to-head comparison of LiverAID models vs standard blood-based indices in predicting significant liver stiffness (LSM > 8 kPa).
The accuracy, sensitivity, specificity, PPV and NPV of using: 1) a cut-off approach for standard blood-based indices, 2) logistic regression models with standard blood-indices as predictors, and 3) ensemble learning models, is shown in Table 2. In the first approach, low cut-off values (i.e. used to rule out the presence of significant fibrosis) of FIB-4 = 1.25 3 , Forns = 4.1 2 and APRI = 0.5 4 , were used, which resulted in 9, 27 and 10 missed diagnosis (patients with significant LSM) for every 100 patients with FIB-4 < 1.25, Forns < 4.1 or APRI < 0.5, respectively. In the second and third approach, probability thresholds for each model were optimized in the validation dataset and then directly applied to the hold-out data set to evaluate model's performance. Our goal was, in all logistic and LiverAID models, to determine probability cut-off values that result in NPV ≥ 0.98 in the validation dataset (i.e. ≤ 2 patients with missed diagnoses for every 100 patients with a negative test result). However, logistic regression models could not achieve this targeted NPV, and in these cases, thresholds were chosen to obtain the maximum possible NPV. This resulted in a NPV of logistic models in the hold-out dataset of 0.91, 0.88, 0.91 and 0.92, and PPV of 0.31, 0.57, 0.36 and 0.33, when using as predictors  www.nature.com/scientificreports/ S, M, L and 4XL, respectively) ( Table 2). Diagnostic performance measures for the prediction of significant liver stiffness (LSM > 8 kPa) in the hold-out data set, for each subpopulation (i.e. subjects at risk of NAFLD, subjects at risk of alcohol-related liver disease, and subjects randomly selected from the general population) is shown in Table 3.
The AUC in the hold-out dataset, of univariate logistic models using standard indices as predictors was 0.  Table 4). When comparing LiverAID models among themselves, the difference in classification performance between LiverAID XXS and XS was not statistically significantly (p > 0.10). Similarly, LiverAID S, M and L did not significantly differ from each other (p > 0.10). The classification performance of LiverAID S, M and L was, in the great majority of the cases/repetitions (93%), significantly better than LiverAID XXS (p < 0.05), and in most cases/repetitions, significantly better than LiverAID XS (p < 0.05), but this last result did not stay consistent throughout (Table 4 and Appendix). Finally, model LiverAID 4XL clearly outperformed XXS and XS (p < 0.05), but it showed contrasting results when compared to LiverAID S, M and L (p = 0.014-0.219) ( Table 4 and Appendix).

Discussion
Our study demonstrates that ensemble learning models that use routinely available clinical data as inputs, are able to appropriately detect clinically significant liver fibrosis in low-prevalence settings. In comparison to models using traditional regression techniques and standard blood-based indices, our strategy using ensemble learning demonstrated significantly better diagnostic performance.
Model selection techniques should find an optimal trade-off between the ability of the model to fit data and the model's required complexity to do so. In relation to this, much emphasis has been placed in the literature on obtaining predictive models for liver fibrosis that are "simple" 12 . Strictly speaking, complexity can be separated into the model-complexity and the inputs-complexity dimension. Briefly, the model-complexity dimension pertains to how complex is the model itself, which affects, among others, the prediction time. In praxis, differences  www.nature.com/scientificreports/ in prediction time are inconsequential, because with the current computational power of personal computers and the availability of cloud computing, the result on the predicted status of fibrosis for each new patient can be in all approaches considered in this article, available for the clinician at the click of a button. The inputs-complexity dimension pertains to how many input parameters on each new patient are required by the model to make a prediction, and how costly and feasible is obtaining information on these specific parameters in clinical settings. All our LiverAID models involve objective and readily available laboratory variables and non-invasive clinical information; and none of them require information from invasive or resource intensive procedures. In this regard, they are therefore more advantageous than other methods such as ELF or TE. The question remains now how do the LiverAID models compare to each other and to traditional blood-based indices. Whilst acknowledging the very high performance of LiverAID 4XL (AUC = 0.94), this comprehensive model will not be, most likely, Table 3. -Diagnostic performance measures for the prediction of significant liver stiffness defined as measured liver stiffness (LSM) > 8 kPa, evaluated using the hold-out (completely unseen) dataset, for each subpopulation: subjects at risk of NAFLD, subjects at risk of alcohol-related liver disease (ALD), and subjects randomly selected from the general population. www.nature.com/scientificreports/ the "method of choice", since it requires for the clinician to obtain and entry into the model a large amount of non-automated information from each investigated patient, which is very time-consuming, and entails a high risk of introducing operator errors. www.nature.com/scientificreports/ Placing greater value on simplicity and usability, the remaining models (XXS to L) may then be considered. Models LiverAID L and M do not imply a significantly better performance compared to LiverAID S; while LiverAID XS did not outperform LiverAID XXS. Therefore, two models stand out as being especially efficient in their ability to predict significant liver stiffness: LiverAID XXS (AUC = 0.86) and LiverAID S (AUC = 0.91). The main advantage of LiverAID XXS is that it exclusively relies on data from 9 objective laboratory markers obtained from routine blood tests. This could be particularly valuable when aiming at identifying people with asymptomatic liver disease in the general population. No demographic or clinical information about the patient is required, beyond these 9 serum markers; thereby releasing resources (i.e. the cost of performing history taking and physical examination) that could be used in increasing the population undergoing laboratory testing for initial screening for liver fibrosis. In comparison, LiverAID S relies on only 5 serum markers, but it requires some basic demographic and clinical information about the patient. It should be highlighted though, that Liv-erAID S performed significantly better than LiverAID XXS, while still affording a reasonable balance between the goodness of fit and number of parameters required by the model. As to FIB-4, Forns and APRI regression models, these require data on 2-6 parameters. However, the performance of these traditional approaches was also markedly inferior (AUC = 0.60-0.76, vs. AUC = 0.86-0.91, p = 0.000-0.001).
One of the most important aims of screening for liver fibrosis is to make clinical decisions about further diagnostic tests and possibly treatments, by excluding subjects with zero-to-little fibrosis. We address the question of how many subjects testing negative using each of the investigated methods, do have significant liver stiffness, despite having negative test results. Standard blood-based indices, commonly used in clinical practice, performed poorly in identifying candidates at low risk of having significant liver fibrosis in that: 9-27% of patients with lower scores than the proposed cut-off threshold, and 8-12% of patients with a negative logistic regression result based on FIB-4, Forns and/or APRI as predictors; had LSM > 8 kPa. In comparison, all LiverAID models could achieve a NPV ≥ 0.98. With a negative result overlooking the existence of significant LSM in only ≤ 2% of the patients, we could focus on what the test means in patients with positive test results. In a clinical context, a patient with a positive LiverAID test would be referred for further investigations (e.g. TE). A patient classified as positive using LiverAID XXS (i.e. predicted LSM > 8 kPa) has a 22% probability of subsequently obtaining an actual LSM > 8 kPa and a 78% probability of obtaining LSM ≤ 8 kPa (i.e. unnecessary TE examination or overtesting). In the case of using LiverAID S, these probabilities are 30% and 70%, respectively. Consequently, using LiverAID S instead of LiverAID XXS, translate into an estimated reduction of 8% in the number of patients undergoing unnecessary transient elastography tests. The decision on using LiverAID S vs. LiverAID XXS may have thus economic implications for the healthcare system. These are currently undetermined since performing cost-effectiveness analyses of alternative health assessment strategies is beyond the scope of this paper. Finally, we evaluated the performance across different clinically relevant subgroups, i.e. subjects at risk of NAFLD, subjects at risk of alcohol-related liver disease (ALD), and subjects randomly selected from the general population. LiverAID models showed an adequate performance in each of the three subgroups. In LiverAID XXS and LiverAID S, the NPV's were 0.98, 0.96 and 0.99, for NAFLD, ALD and general population subgroups, respectively. In LiverAID XXS, the PPV's were 0.30, 0.30 and 0.19, for NAFLD, ALD and general population subgroups, respectively; while in LiverAID S, the PPV's were 0.31, 0.33 and 0.23.
It is well established that LSM can be used as a surrogate of liver fibrosis, and that TE is a sensitive tool to be used, as a triage test before biopsy, for identifying patients that could obviate biopsy. If the NPV of LiverAID models in predicting significant biopsy-assessed liver fibrosis was sufficiently high in comparison with the NPV of TE, then in clinical practice a negative LiverAID result (instead of a negative TE result), could be used to avoid the need for liver biopsy. In the primary care clinical setting, which is the relevant implementation context for the LiverAID models, a slightly lower NPV of LiverAID, compared to TE would also be clinically acceptable. This is because in primary care, a clinical decision must be made as of whether a patient should be referred for further examinations or not (unlike in specialized departments, where often a decision has to be made regarding the need of performing a liver biopsy). Our study showed that, in the subgroup of patients who underwent a liver Table 4. -P-values for the comparison of AUC between LiverAID models and standard blood-based indices in predicting significant liver stiffness (LSM > 8 kPa). Ranges indicate the minimum and maximum P-values from the repeated models (repetitions 1 to 5). The results for each repetition can be found in Appendix. www.nature.com/scientificreports/ biopsy, 9% of the patients with a negative TE result had significant biopsy-assessed liver fibrosis (F2-F4 kleiner stage). In comparison, 10% and 15% of the patients with a negative LiverAID XXS and S result, had a F2-F4 kleiner stage of liver fibrosis. Therefore, LiverAID XXS and S showed a relatively good ability to reliably predict the absence of significant biopsy-assessed fibrosis stage (F2-F4), in patients who were assessed for suspected liver fibrosis using liver biopsy. We must point out that, in our study, a liver biopsy was performed when a subject had a LSM > 8 kPa; therefore, the biopsy subgroup cannot be considered representative of a "low prevalence population". One main limitation of our study is that validation was performed internally, using a random sample of the population of study. Although this data was not used at any stage during the model building process, further studies are needed to evaluate whether the models maintain their diagnostic performance in external populations. In the general population, the prevalence of significant liver stiffness (> 8 kPa) has been reported as 2-7.5% 13,14 , while prevalences have been reported as being 34% among patients with type 2 diabetes and 18.3% among hazardous alcohol users 8 . The prevalence observed in our study (13%) accords with the fact that, in our cohort, the frequency of subjects with known risk factors for liver fibrosis is somewhat overrepresented compared to the general population. This is both deliberate and clinically justified, since, in clinical practice, targeting patients with known risk factors is a more effective screening strategy for identification of patients with asymptomatic chronic liver disease 8 . It is worth noting that in a diminished prevalence setting (i.e. general population), the true ability of a negative LiverAID test to rule out significant fibrosis will be increased, as well as the number of unnecessary TE examinations. For future research, it would be desirable to validate the LiverAID models in a low prevalence cohort with liver biopsy as the reference standard.
In conclusion, we present a set of AI-based models with different complexities, able to successfully predict clinically significant liver fibrosis, using patient data that can be available at a low cost in clinical settings. Two of our ensemble models stand out as being especially efficient, one requiring 9 routine serum markers, and another one requiring 5 routine serum markers and 8 basic demographic/clinical variables. Given the ready availability of the required data, along with the relatively high accuracy in separating patients' risks, our ensemble models seem to be valuable and practical tools that could be used clinically for early identification of patients with asymptomatic chronic liver diseases in primary care.