Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Jalali-najafabadi, Farideh; Stadler, Michael; Dand, Nick; Jadon, Deepak; Soomro, Mehreen; Ho, Pauline; Marzo-Ortega, Helen; Helliwell, Philip; Korendowych, Eleanor; Simpson, Michael A.; Packham, Jonathan; Smith, Catherine H.; Barker, Jonathan N.; McHugh, Neil; Warren, Richard B.; Barton, Anne; Bowes, John

doi:10.1038/s41598-021-00854-x

Download PDF

Article
Open access
Published: 02 December 2021

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Farideh Jalali-najafabadi¹,
Michael Stadler¹,
Nick Dand²,
Deepak Jadon³,
Mehreen Soomro¹,
Pauline Ho^1,4,
Helen Marzo-Ortega⁵,
Philip Helliwell⁵,
Eleanor Korendowych⁶,
Michael A. Simpson²,
Jonathan Packham⁷,
Catherine H. Smith⁸,
Jonathan N. Barker⁹,
Neil McHugh⁶,
Richard B. Warren¹⁰,
Anne Barton^1,4,
John Bowes^1,4,
BADBIR Study Group &
BSTOP Study Group

Scientific Reports volume 11, Article number: 23335 (2021) Cite this article

3741 Accesses
12 Citations
11 Altmetric
Metrics details

Subjects

Abstract

In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the ‘lowest number of feature subset’ with the ‘maximal average AUC over the nested cross validation’ and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.

Disease prediction with multi-omics and biomarkers empowers case–control genetic discoveries in the UK Biobank

Article Open access 11 September 2024

Estimating disease prevalence in large datasets using genetic risk scores

Article Open access 08 November 2021

Electronic health records and polygenic risk scores for predicting disease risk

Article 31 March 2020

Introduction

Precision medicine has the potential to have an enormous impact on healthcare; however, for this potential to be fully realised, we need to be able to accurately predict the outcome of patients in different clinical scenarios. The wealth of genetic and clinical data that is now available for medical research provides an unprecedented opportunity to explore machine learning (ML) approaches for the prediction of the clinical outcomes^1,2.

The use of genetic data in the development of risk prediction models presents a number of challenges mainly attributable to the large number of genetic variants available following imputation strategies and the high degree of redundancy between features due to linkage disequilibrium (LD). Many of these genetic variants may be completely irrelevant to the specific question being asked, or redundant in the context of other features. This may contribute to the increased computational burden of processing many similar features and the potential of overfitting to irrelevant aspects of the data. Therefore, it is important to identify a subset of the original variables (features) that enable more accurate prediction by the elimination of irrelevant or redundant information.

Filter methods, based on the information theoretic criteria, are particularly suited to these challenges as they are computationally less intensive than other methods, less likely to overfit and they evaluate the relationships of the features independent of any specific classifier³. In addition, information theory based on mutual information has the advantage of accounting for both linear and non-linear dependencies that exist between features whereas some traditional statistical methods such as logistic and lasso regression assume an additive genetic model^4,5. This is of particular importance for many autoimmune diseases where genetic variants in the human leukocyte antigen (HLA) genes confer a substantial proportion of disease risk and studies have demonstrated highly significant non-additive effects⁵. In addition, the construction of a genetic prediction model may be confounded by issues such as population stratification, often represented by principal components, and ascertainment bias attributed to the method of sample collection. Here we explore the use of information theory based filter feature selection methods on HLA data to classify psoriatic arthritis (PsA)^6,7 from cutaneous-only psoriasis. This is a clinically important question as approximately 30 percent of patients with psoriasis may develop PsA potentially leading to long-term disability and lower quality of life^8,9,10. The ability to predict which psoriasis patients have a higher risk of developing PsA could lead to intervention strategies that would limit disability. We have previously shown that ascertainment bias in this data caused by the preferential collection of psoriasis cases with a young age of disease onset leads to confounding¹¹ and here we illustrate the use of a stratification method to deal with such issues. Finally, we present an independently validated genetic prediction model based in information theoretic methods.

The following contributions are made:

The development of a stratification approach to mitigate confounding
We show that confounding features impacts the feature selection and can be successfully mitigated by stratification
We demonstrate the utility of filter information theoretic methods for feature selection in highly complex genetic datasets such as the HLA region
We present an externally validated risk prediction model for PsA using HLA data

To our knowledge, no comparable techniques with stratification, information theoretic feature selection and external validation have been applied previously to the analysis of the HLA region in PsA and psoriasis

Methods

Sample cohorts

Our training dataset consisted of 1462 PsA patients recruited from rheumatology centres within in the UK. Classification of PsA was performed by a rheumatologist based on coexistence of psoriasis and inflammatory arthritis in accordance with the CASPAR (ClASsification criteria for Psoriatic ARthritis) classification system¹² where possible. Recruitment was performed with full written informed consent from the patients (UK PsA National Repository MREC 99/8/84), all methods followed relevant guidelines and legistlation. This study was approved by Central Manchester NHS Research Ethics Committee. Data on 1132 cutaneous-only psoriasis patients (PsC) were available through the Biomarkers of Systemic Treatment Outcomes in Psoriasis study (BSTOP)¹¹. Patients were recruited to BSTOP via the British Association of Dermatologists Biologics Interventions Registry (a UK pharmacovigilance registry, BADBIR.org.uk) from dermatology clinical within the UK. PsC classification in BSTOP is based on interview questioning of rheumatologist diagnosed PsA at baseline and follow-up visits (twice annually for the first three years of follow-up, then annually).

Imputation of HLA alleles

Genotyping of DNA from PsA patients was performed with the Illumina Immunochip array as previously described¹³. Genotyping of DNA from PsC patients was performed at King’s College London using the Illumina HumanOmniExpressEx- ome-8v1-2_A as previously described¹⁴. Quality control was consistent across both dataset following conventional standards of data missingness (SNP and sample), SNP allele frequency, Hardy-Weinberg equilibrium and sample outliers based on relatedness and ancestry. Datasets were combined retaining an intersection of high quality SNPs. Quality control of UK Biobank genotype data is described in details by Bycroft et al¹⁵. Imputation of SNPs, amino acids and HLA alleles for training and validation datasets was carried out using SNP2HLA software package (version 1.0.3) using the T1DGC reference panel¹¹. Variants with an information score < 0.9 or a MAF < 0.01 were excluded and all analyses were conducted using imputed dosage. Following quality control, the training dataset consisted of 2093 patients with 172 HLA alleles(2-digits,4-digits), 683 amino acids, 5862 SNPs. To ensure that our trained model is applicable to the validation data in UK Biobank, the data was filtered to only contain features that are shared between internal training cohort and external validation cohort. This analysis focuses on the 70 HLA 2-digits features and three potential confounders (age of psoriasis onset (aao), the top two principal components ’PC1’ and ’PC2’ for mitigation of population stratification¹⁶). The four stages research pipeline is illustrated in Fig. 1: data pre-processing, confounding mitigation, feature selection and model development for the subtype prediction.

Stratification development

To mitigate the effect of confounders in feature selection a ‘stratification’ method¹⁷ was developed to control for the three confounders of concern where the association between each feature and the outcome is tested within different strata of the confounding feature. During stratification, individuals are divided into several strata on the basis of confounders where the number of individuals in the strata may or may not be equal. Figure 2 illustrates the methodology of this approach for the stratification where ‘Feature 2’ has been assigned as the known age of psoriasis onset confounder. The minimum and maximum value for ‘Feature 2’ was determined and restricted to narrow width bins ‘0-20’ years ‘21-40’, ‘41-60’, ‘61-80’ and ‘81-100’. The Frequency distribution of each target label (‘PsC’=‘0’, ‘PsA’=‘1’) was balanced in each age boundary by random sampling with replacement(bootstrap)¹⁸. For instance, in Fig. 2 in ‘Stratification Unit’ the number of patients in age boundary ‘61-80’ for PsC is ten less than PsA patients and the frequency distribution of two classes with target 0 and 1 was balanced by the inclusion of 10 random samples in target class 0. The same procedure was applied to ‘PC1’ and ‘PC2’ where patients(n) are divided into two strata (− 1,0), [0,1).

Information theoretic feature selection

Filter methods select features based on a performance measure regardless of the employed data modeling algorithm and separate the classification and feature selection components³. Filter methods are generally applied as pre-processing steps, with subset selection procedures that are independent of the learning algorithm and the defining component of filter based methods is scoring criterion, which is often as ‘relevance index’. The relevance index denotes how useful each feature is likely to be for the ML classification methods. Although this leads to a faster learning process, it is possible for the criterion used in the pre-processing step to result in a subset that may not work very well downstream in the learning algorithm. Univariate and multivariate methods are two categories for all filter based methods. Univariate methods, the scoring criterion only consider the relevancy of features while ignoring the feature redundancy. Mutual information is univariate feature selection approach (Shannon, 1948)^19,20 measures the amount of information shared by an input feature X and class label (target) Y. Where the lower case x or y is possible values that the variables X and Y can adopt from the alphabet X and Y respectively in (1). To obtain this, we need to estimate the distribution of $p_{{\mathrm{x}}}$ and $p_{{\mathrm{y}}}$ respectively.

$$\begin{aligned} I(X;Y)= \sum _{x\in x}\sum _{y \in y} p(xy)log\frac{ p(xy)}{p(x)p(y)} \end{aligned}$$

(1)

Mutual Information Maximization (MIM) method given by (2) examines the mutual information between a class label Y and a feature $\hbox {X}_{{\mathrm{k}}},$ where K is the top features²¹. MIM assumes that all the features are independent and it does not account any dependencies between the features.

$$\begin{aligned} J_{\mathrm{{mim}}} (X_{{\mathrm{k}}})= I(X_{{\mathrm{k}}};Y) \end{aligned}$$

(2)

Multivariate method investigates the multivariate interaction within features and the scoring criterion is a weighted sum of feature relevancy and redundancy. The information theoretic methods investigate the multivariate interaction within features and the scoring criterion is weighted sum of feature relevancy and redundancy. Multivariate feature selection methods are described as follows. Joint Mutual Information (JMI) was proposed by Yang and Moody (1999)^22,23. JMI is the information between the targets and a joint random variable defined by pairing the candidate $\hbox {X}_{{\mathrm{n}}}$ with each current feature. The redundancy term full captures by JMI.

$$\begin{aligned} J_{{\mathrm{jmi}}}(X_{{\mathrm{k}}})= \sum _{X_{{\mathrm{j}}}\in S}I(X_{{\mathrm{k}}}X_{{\mathrm{j}}};Y) \end{aligned}$$

(3)

Minimal-Redundancy-Maximal-Relevance (mRMR) given by (4) was proposed by Peng et al²⁴. This takes the mean of redundancy term and it elemenated the conditional term. In equation (4), n is size of a feature set.

$$\begin{aligned} J_{{\mathrm{mrmr}}}(X_{{\mathrm{k}}})= I(X_{{\mathrm{k}}};Y) - \frac{1}{|S|}\sum _{{j}\in S} I(X_{{\mathrm{k}}};X_{{\mathrm{j}}}) \end{aligned}$$

(4)

Conditional Mutual Information Maximization (CMIM) given by (6) was proposed by Fleuret (2004)²⁵ and is probably the most-well known recent criterion. CMIM measures the information between a feature and the target and it is conditioned on each current feature. The interaction information is the term in square brackets which can be both negative and positive. A negative value indicate that the shared information between $\hbox {X}_{{\mathrm{k}}}$ and Y has decreased as the result of including $\hbox {X}_{{\mathrm{n}}}$.

$$\begin{aligned} J_{{\mathrm{cmim}}}(X_{{\mathrm{k}}})=I(X_{{\mathrm{k}}};Y)- \underset{{X_{{\mathrm{j}}}}\in S}{max}[I(X_{{\mathrm{k}}};X_{{\mathrm{j}}})-I(X_{{\mathrm{k}}}; X_{{\mathrm{j}}}|Y)] \end{aligned}$$

(5)

Mutual information feature selection (MIFS) does not consider conditional redundancy (g = 0), but it does incorporate the redundancy penalty (Brown et al., 2012).

$$\begin{aligned} J_{{\mathrm{mifs}}}(X_{{\mathrm{k}}})=I(X_{{\mathrm{k}}};Y)-\beta \sum _{X_{{\mathrm{j}}}\in S}I(X_{{\mathrm{k}}}; X_{{\mathrm{j}}}) \end{aligned}$$

(6)

Double input symmetrical relevance (DISR) aims to better include such complimentary features by expanding JMI²⁶. Disr normalises the information provided by a feature by how well the given feature complements the other features.

$$\begin{aligned} J_{{\mathrm{disr}}}(X_{{\mathrm{k}}})=\sum _{X_{{\mathrm{j}}}\in S}\frac{I(X_{{\mathrm{k}}}X_{{\mathrm{j}}};Y)}{H(X_{{\mathrm{k}}}X_{{\mathrm{j}}}Y)} \end{aligned}$$

(7)

The interaction capping (ICAP) approximated by following equation.

$$\begin{aligned} J_{{\mathrm{icap}}}(X_{{\mathrm{k}}})={I(X_{{\mathrm{k}}};Y)}-\sum _{X_{{\mathrm{j}}}\in S} max[0,\{I(X_{{\mathrm{k}}};X_{{\mathrm{j}}})-I[(X_{{\mathrm{k}}}; X_{{\mathrm{j}}}|Y)\}] \end{aligned}$$

(8)

We focus on seven filter feature selection (FS) methods: mutual information feature selection (MIFS), mutual information maximisation (MIM), joint mutual information (JMI), minimal-Redundancy-Maximal-Relevance (mRMR), conditional mutual information maximisation (CMIM), Interaction Capping (ICAP) and Double Input Symmetrical Relevance (DISR)²⁷. We selected these methods based on computational efficiency, popularity in the literature and publicly available implementations, which increases their usability. The description of each feature selection can be found in supplementary section 1. Each feature is assigned a rank in order of their FS score and the top K features were selected. Predefined requirements for a certain number of features or other stopping criterion can inform the value of K^21,28.

Figure 3 shows our methodology for feature selection. We created 100 random samples with replacement (Bootstraps)¹⁸ from the original data and the top subset of features were obtained for each FS methods. For information theoretic criteria we estimated the necessary distributions using histogram estimators and features were discretised independently²¹. The HLA alleles were discretised to (0,1), [1,2) and PC1, PC2 to (− 1,0), [0,1). In ‘Feature Selection Aggregator’ the outputs vote ‘V’ and rank ‘R’ are generated respectively. The vote ‘V’ for a features criteria defines the majority voting over 100 bootstraps. The rank ‘R’ defines the average rank over 100 bootstraps as the rank of each feature in the top selected features can vary in each bootstrap.

Once this ranking has been computed, a feature subset composed of the best feature subset was created. For instance, Feature 1 with vote=90 and rank ‘R=3.5’ over 100 bootstraps is selected as the top ‘1’ feature subset. We incrementally selected top features ranging from (n=1,10, … 70) using each of the seven feature selection methods. This subset of selected features was then used as an input to each of seven supervised machine learning (ML) algorithms.

The top selected feature subset may vary with respect to each FS criterion. We therefore proposed overall ranking Fig. 3 with the ‘Technique Vote’, the ‘Average Bootstraps Vote’ (ABV) and ‘Average Bootstrap Rank’(ABR) that explore the rank of features across ‘seven different FS techniques’. The ‘Technique Vote’ is a selection of feature by FS criteria. The ‘Average bootstraps vote’ and ‘Average bootstrap rank’ are defined by equations 9 and 10 respectively.

$$\begin{aligned} \text{ ABV }= \frac{\text{ The } \text{ sum } \text{ of } \text{ votes } \text{ number } \text{(v) } \text{ in } \text{ each } \text{ FS } \text{ criteria }}{\text{ Number } \text{ of } \text{ FS } \text{ techniques }} \end{aligned}$$

(9)

$$\begin{aligned} \text{ ABR }= \frac{\text{ The } \text{ sum } \text{ rank } \text{ of } \text{ feature } \text{(R) } \text{ in } \text{ each } \text{ FS } \text{ criteria }}{\text{ Number } \text{ of } \text{ FS } \text{ techniques }} \end{aligned}$$

(10)

All these feature selection methods are compared with the case when all of the 70 HLA 2-digits features are fed into the classifier for prediction. All feature selection methods are publicly available from the package skfeature(1.00) feature open access repository of Python(3.6.10) programming language which provides individual rankings to each feature in the database.

Supervised risk prediction model development and internal validation

In this study, risk prediction models for PsA were developed using seven supervised ML algorithms:^29,30,31 Logistic Regression (LR), AdaBoost, XGBOOST, Random Forest (RF), K-nearest neighbor classifier (KNNC), Decision Tree (DT) and Gaussian naive bayes (NB)²⁹. For each feature subset the response measurement is PsA (class=1) and PsC (class=0). The original set of examples provides the training data and the learning algorithm has been trained and validated using stratified nested cross-validation. Many of the machine learning algorithms employed have one or more hyper-parameters that must be selected to optimise model performance. The most optimal hyperparameter for each ML model have been obtained using 5-2 fold nested cross validation stage.The purpose of our nested cross-validation was to find an unbiased view of the overall expected performance of each model, so the hyperparameter tuning process in this step only help to find the most accurate version of each algorithm in each nested-cross validation fold. The aim in this stage is only to evaluate and compare the learning power of each algorithm using different folds of data and removing any bias in the process. Thus, to achieve the optimal hyperparameters for each ML model we re-trained and performed a hyperparameter tuning for all the algorithms using the entire training data set. These hyperparameters are then used to evaluate the performance of each model on the test dataset.

External validation of risk prediction models

Ultimately, fully independent external validation with data available at the time of PsA prediction development is important. Here we use data from UK Biobank for external validation to test the generalisability of seven developed ML classifiers on PsA data. The assessment is for reproducibility rather than transportability as the external data is very similar to the PsA-MD data set³². Figure 4 presents our pipeline for internal model development and external validation. 80% of data was randomly used for training of ML classifiers using 5-2-fold stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation.

There are 448 different models with post mitigation trained using a combination of ‘number of features (1,10,...70)’, ‘7 feature selection methods’, ‘7 ML Model type’. We used the whole data and the optimal hyperparameter to test the best generated models in the UK Biobank dataset. Therefore, for each ML models 448(all models)/7(MLmodels)= 64 different combinations have been generated and the models with the maximal average AUC in nested cross validation is selected as the best model and tested for the external validation.

All machine learning analyses were performed in Python (using the numpy, pandas, sklearn, matplotlib, and XGBOOST packages), which provides a user-friendly interface to access many machine-learning algorithms in Python. We used AUC, precision recall curve, precision (positive predictive value (PPV)), recall (true positive rate or sensitivity) and F1 score to evaluate the performance of the ML classifiers.

Results

We developed a stratification method to control for known confounding for population stratification (two principal components) and age of psoriasis onset. We then used a range of information theory feature selection methods and ML supervised classification methods to develop a risk prediction model for classifying PsA from cutaneous-only psoriasis. The best model was then externally validated in an independent dataset from UK Biobank to assess the generalisability of the predictive performance.

Impact of confounding on feature selection

We investigated the impact of confounding on feature selection pre and post-mitigation by stratification using seven FS information theoretic criteria methods MIFS, MIM, JMI, mRMR, CMIM, ICAP and DISR. The FS information theoretic methods were applied to a dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis cases¹³ using 2-digit 70 HLA alleles. Figure 5a,b illustrate the top 10 selected features for the seven FS criteria and its vote over 100 bootstraps¹⁸ pre and post-mitigation for the three potential confounders.

HLA_C_*06 is the genetic variant that makes the largest contribution to psoriasis susceptibility and is known to be highly correlated with age of psoriasis onset¹¹. Over 100 different bootstraps and seven FS methods HLA_C_*06 is selected as the most informative genetic variant in the original dataset Fig. 5a. After the mitigation of the three potential confounders, HLA_B_*27 had vote 94 for ‘DISR’ followed by HLA_B_*07 with the vote ‘81’ in ‘ICAP’ and HLA_DRB _01 with the vote ‘83’ in Fig. 5b. PC1 and PC2 were not observed in any of the top 10 features subset and the majority vote over 100 bootstraps for ‘age onset’ had dramatically dropped after the stratification mitigation Fig. 5b. The results demonstrate how confounding impacts the selected features and that the stratification mitigates this impact and this is clearly illustrated by the absence of HLA_C_*06 following the mitigation.

Impact of confounding on classification and internal validation

For ML classification we have used dynamic and fixed number of features where the main aim was to identify a subset of features which maximize the risk prediction model performance in the internal dataset and generalisability to UK Biobank as the external validation set. In order to compare the performance of different models on a varying number of features with feature step (1, 10,...70), the top feature subset were consecutively incorporated into each model pre mitigation with no confounders and post mitigation of three confounding features.

Figure 6 shows the AUC for hold out set for ICAP feature selection for all the seven models ‘pre mitigation with no confounders’ and ‘post mitigation with three mitigated confounders’. It can be observed for ICAP feature selection all classifiers show similar predictive performances. There is $\approx$10% drop in the performance of classification post mitigation of confounding features. In pre-mitigation, the highest performance around 0.63 was obtained for LR, RF AdaBoost and NB Gaussian when the top 20 subset of features selected by ICAP were incorporated into these models. In the post mitigation, the AUC was 53% for the top 1 HLA feature, which was improved by $\approx$20% for 20 features in KNNC and there is drop in the AUC for other classifiers in the post mitigation. In conclusion, adding more features to the models did not improve the AUC dramatically in pre and post mitigation (except KNNC that showed different behaviour). The Figures 2, 3, 4, 5 in supplementary were generated for all other feature selection methods in pre and post mitigation with obtained AUC in nested cross validation and hold out set. All the feature selection and classifiers show similar behaviours as ’ICAP’. The results of overall ranking of feature selection can be found in Figures 2, 3, 4 and 5 in supplementary

External validation of classification models

The risk prediction models were then further validated in the UK Biobank data set containing data on 1187 participants and a set of features overlapping with the training dataset. The validation data set had similar characteristics, although a lower proportion of admissions from the patients with PsA and PsC. Table 1 presents the comparison results of AUC, precision and recall of the best generated models out of 448 different models. The best models number are 402(LR), 303 (Adaboost), DT (416), XGBoost(398), KNNC (232), NB Gussain(39)and RF(184).The performance is dependent on the type of feature selection methods, the number of selected features and the selected prediction models. The results for accuracy and F1 score of the best models can be found in Table 1 supplementary.

Table 1 The best generated models out of 448 generated models.

Full size table

All models for predicting risk of PsA demonstrated moderate predictive performance. In post mitigation, a RF model with 20 features selected by ICAP performed as the final best overall model. RF has good generalisibility and robustness with respect to internal-cross validation (AUC= 0.61, Precision=0.59, Recall=0.54), internal-hold out-set(AUC= 0.58, Precision=0.54, Recall=0.45) and external validation (AUC= 0.58, Precision=0.58, Recall=0.59). Amongst all models KNNC (model number=232, disr, 60 features) is overfitted in UK biobank with AUC internal (cross validation= 0.73, hold out= 0.76) and (AUC external=0.53). Of note, Gussain (model number=39, mim, 10 features), DT (model number=416, disr, 10 features) have very similar AUC to RF but lower precision and recall. Each feature selection and classification model combination have different behaviours to the mitigation techniques. Variability in the classification models and feature selection methods were the main factors in the performance variation. Overall, classification AUC at the model development stage were comparable to AUC where those models were used to predict labels in UK biobank as shown in Fig. 7. Nested cross validation was sufficient to control overfitting and produced results which generalised well to the independent test sample. The hyperparameters for each classifier is shown in Table 2.

Table 2 Machine Learning Algorithms and their Corresponding Hyperparameters.

Full size table

The Figures 8 and 9 in supplementary generated for accuracy, precision, recall and F1-score score of 448 generated models and each model respectively. Figure 8 shows Receiver operating characteristic (ROC) curve and precision-recall (PR) curve for predicting PsA. ROC curve of internal validation (cross validation, holdout-set), external validation (a) and PR curve of internal validation (cross validation, hold out-set) and external validation (b)illustrate that the best classifier RF. The results of ROC curve and PR are generated for other classifiers in supplementary Fig. 6 and Fig. 7 respectively. Figure 9a,b show the positive net benefit for each model within a specific threshold in cross validation and external validation. Probability threshold $\hbox {p}_{{\mathrm{t}}}$ in the studied population is between 25% and 75%. KNNC has the positive net benefit between [25%–75%] in cross validation but its performance drops, measured by AUC, as its shown in Table 1. We can observe that for our best model ‘Random Forest’ a positive net benefit between 45% and 60% threshold probability. The decision curve is generated for each model in cross validation and external datast in Fig. 10 supplementary.

Discussion

Our results clearly demonstrate the impact of confounding on both feature selection and model generalisability and how stratification can mitigate this effect. This is well illustrated by considering the results for HLA_C_*06, which is a known risk factor for early onset psoriasis (type I): we have previously shown the preferential collection of type I psoriasis in genetic studies can lead to ascertainment bias when compared with PsA. Our stratification approach mitigated this effect leading to the expected identification of HLA_B_*B27 as the predominant risk factor for PsA in psoriasis.

The issue of selection bias and confounding is increasingly being recognised as an important issue in the statistical methodology literature^33,34. The main aim of this work was to demonstrate the application of information theory methods to genetics, in particular for feature selection in complex regions of the genome, namely the MHC, and in the presence of confounding. The results from our external validation confirm the generalisability and reliability of the AUC values obtained in the training data following mitigation by stratification. In addition, high dimensional data with many redundant features, such as found with genetic datasets, are a significant challenge for Machine Learning³⁵. Our results demonstrate the utility of information theory feature selection methods in complex genetic datasets, such as the HLA region, which, coupled with the fact that they are less prone to overfitting and computationally efficient, makes them attractive options for feature selection in genetic datasets.

Our study has several strengths: firstly, we aimed to avoid overfitting in two stages, initially at the internal stage where we used stratified nested cross validation and tested each model on unseen data (hold out set) and subsequently we externally validated the best models on a completely independent dataset. Secondly, our feature selection method is independent of classification methods and it does not assume an additive or linear relationship between the features and the outcome. Many prediction models and risk scores have been developed with feature selection methods based on traditional statistical approaches such as logistic and lasso regression. The traditional methods will fit better if the data is linearly separable^36,37. If such a linear relationship does not exist then the model may oversimplify complex relationships among features with non-linear interactions, leading to the potential loss of significant relevant information^38,39 which is likely to be the cases in the HLA region for many autoimmune diseases where non-additive effects have been reported^5,40.

The moderate performance of our prediction model in internal and external validation could be explained by the fact that the imputed HLA alleles are not sufficient to differentiate PsA from cutaneous-only psoriasis despite this being the major PsA genetic risk factor⁴¹.

In addition, the sample size and cross-sectional nature of the dataset may also be a limitation for a training machine learning algorithm where the performance of risk prediction model based on ML classifiers will be better if the number of training samples is large⁴². The research looked at the genetic variants found in the MHC region, so, the genetic variants outside of the MHC region may improve the prediction models performance. Combination of clinical data and genetics data can be used in a longitudinal fashion to improve performance.

We used classic information theoretic methods, so using two state of art models in information theory may improve the performance of the models and the selection of informative features.

Finally, whilst doing our utmost to ensure that the cutaneous-only psoriasis reference groups were free from PsA there is the potential for phenotype misclassification where a proportion of these participants have gone on or will go on to develop PsA. The BSTOP patients are screened for PsA with the use of a questionnaire which is not as efficient as screening by a rheumatologist. In this study, we can assume a certain level of undiagnosed PsA in the PsC group which will impact the classification accuracy^43,44. In general larger number with clearly characterise is important, both for PsC and PsA . In PsC, area of involvement, as well as overall Psoriasis Area and Severity Index (PASI) and nail disease should be taking into consideration.This would impact both model training and external validation.

An ML algorithm is considered non generalisable and unstable if a small change in the training set causes a large change in the performance of the algorithm⁴⁵. The more stable an algorithm, the more reliable are its results and the greater the confidence in the results. It is not adequate for an ML algorithm to perform well on a hold out test dataset, ideally it must also be stable and generalisable to external dataset. To our knowledge this is the first study to explore the application of information theoretic feature selection methods to genetic data. A recent study exploring machine learning methods for the prediction of PsA reported an AUC of 0.58 in cross-validation and 0.54 on the training dataset using five HLA variants. We have used the established ‘classic information theoretic methods’ which have currently available libraries. Two state of art information theoretic methods ‘Feature selection considering Uncertainty Change Ratio of the Class Label’ and ‘Feature redundancy term variation for mutual information-based feature selection’ may improve the performance of the prediction models⁴⁶. In conclusion, our study demonstrates the ability of stratification approach to mitigate the impact of confounding and we present an externally validated model based on data from the HLA genes for predicting risk of PsA in patients with psoriasis.

Conclusion and future work

This study showed the ability of stratification, filter feature selection methods and machine learning to identify risk factors and predict outcome across genetic data, which should lead to greater insights on disease risk factors with no prior assumption of causality. To our knowledge this is the first study to assess the impact of confounders on feature selection using information theoretic methods and characterise the risk of developing PsA using of machine learning algorithms in a UK psoriasis population. Further validation of the developed methods with different clinical outcomes, different biomarkers, wider spectrum of genetic variables and also different PsA cohorts could provide better insights about their applicability. Future research in the area should move towards combining clinical data and genetics in a longitudinal manner for better prediction of the outcome. MIFS, MIM, JMI, mRMR, CMIM, ICAP, and DISR are seven classic approaches for feature selection employed in the proposed methodologies. The proposed stratification and machine learning methods should be compared to two state of art methods models: ‘Feature selection considering Uncertainty Change Ratio of the Class Label’⁴⁷ and ‘Feature redundancy term variation for mutual information-based feature selection’⁴⁸ as the future work.

Code availability

Source codes of the programmes and algorithms used for this study are available from the corresponding author upon reasonable request.

References

Shamout, F., Zhu, T. & Clifton, D. A. Machine learning for clinical outcome prediction. IEEE Reviews in Biomedical Engineering (2020).
Savage, N. Better medicine through machine learning. Commun. ACM 55, 17–19 (2012).
Article Google Scholar
Guyon, I. & Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003).
MATH Google Scholar
Davis, J. V., Kulis, B., Jain, P., Sra, S. & Dhillon, I. S. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning 209–216 (2007).
Lenz, T. L. et al. Widespread non-additive and interaction effects within hla loci modulate the risk of autoimmune diseases. Nat. Genet. 47, 1085–1090 (2015).
Article CAS Google Scholar
Bowcock, A. M. & Cookson, W. O. The genetics of psoriasis, psoriatic arthritis and atopic dermatitis. Hum. Mol. Genet. 13, R43–R55 (2004).
Article CAS Google Scholar
Liu, Y. et al. A genome-wide association study of psoriasis and psoriatic arthritis identifies new disease loci. PLoS Genet .
Ibrahim, G., Waxman, R. & Helliwell, P. The prevalence of psoriatic arthritis in people with psoriasis. Arthritis Care Res. 61, 1373–1378 (2009).
Article CAS Google Scholar
Ritchlin, C. T., Colbert, R. A. & Gladman, D. D. Psoriatic arthritis. New Engl. J. Med. 376, 957–970 (2017).
Article Google Scholar
Alinaghi, F. et al. Prevalence of psoriatic arthritis in patients with psoriasis: a systematic review and meta-analysis of observational and clinical studies. J. Am. Acad. Dermatol. 80, 251–265 (2019).
Article Google Scholar
Bowes, J. et al. Cross-phenotype association mapping of the mhc identifies genetic variants that differentiate psoriatic arthritis from psoriasis. Ann. Rheum. Dis. 76, 1774–1779 (2017).
Article CAS Google Scholar
Taylor, W. et al. Classification criteria for psoriatic arthritis: development of new criteria from a large international study. Arthritis Rheum. Off. J. Am. College Rheumatol. 54, 2665–2673 (2006).
Article Google Scholar
Bowes, J. et al. Dense genotyping of immune-related susceptibility loci reveals new insights into the genetics of psoriatic arthritis. Nat. Commun. 6, 1–11 (2015).
Google Scholar
Dand, N. et al. Hla-c* 06: 02 genotype is a predictive biomarker of biologic treatment response in psoriasis. J. Allergy Clin. Immunol. 143, 2120–2130 (2019).
Article CAS Google Scholar
Bycroft, C. et al. The uk biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article ADS CAS Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS Google Scholar
Jager, K., Zoccali, C., Macleod, A. & Dekker, F. Confounding: what it is and how to deal with it. Kidney Int. 73, 256–260 (2008).
Article CAS Google Scholar
Davison, A. C. & Hinkley, D. V. Bootstrap Methods and their Application. 1 (Cambridge University Press, 1997).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Techn. J. 27, 379–423 (1948).
Article MathSciNet Google Scholar
Verdu, S. Fifty years of shannon theory. IEEE Trans. Inf. Theory 44, 2057–2078 (1998).
Article MathSciNet Google Scholar
Brown, G., Pocock, A., Zhao, M.-J. & Luján, M. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13, 27–66 (2012).
MathSciNet MATH Google Scholar
Yang, H. & Moody, J. Feature selection based on joint mutual information. In Proceedings of international ICSC symposium on advances in intelligent data analysis, vol. 1999, 22–25 (Citeseer, 1999).
Brown, G. A new perspective for information theoretic feature selection. In Artificial intelligence and statistics, 49–56 (PMLR, 2009).
Peng, H., Long, F. & Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005).
Article Google Scholar
Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004).
MathSciNet MATH Google Scholar
Bennasar, M., Setchi, R. & Hicks, Y. Feature interaction maximisation. Pattern Recogn. Lett. 34, 1630–1635 (2013).
Article ADS Google Scholar
Vergara, J. R. & Estévez, P. A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 24, 175–186 (2014).
Article Google Scholar
Duch, W. Filter methods. In Feature Extraction, 89–117 (Springer, 2006).
Kotsiantis, S. B., Zaharakis, I. & Pintelas, P. Supervised machine learning: A review of classification techniques. Emerg. Artif. intell. Appl. Comput. Eng. 160, 3–24.
Jalalinajafabadi, F. Computerised GRBAS Assessement of Voice Quality. Ph.D. thesis, The University of Manchester (United Kingdom) (2016).
Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of machine learning. ch. 1, 1–3 (2012).
Justice, A. C., Covinsky, K. E. & Berlin, J. A. Assessing the generalizability of prognostic information. Ann. Internal Med. 130, 515–524 (1999).
Article CAS Google Scholar
Choi, H. K., Nguyen, U.-S., Niu, J., Danaei, G. & Zhang, Y. Selection bias in rheumatic disease research. Nat. Rev. Rheumatol. 10, 403 (2014).
Article Google Scholar
Yaghootkar, H. et al. Quantifying the extent to which index event biases influence large genetic association studies. Hum. Mol. Genet. 26, 1018–1030 (2017).
CAS PubMed Google Scholar
Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J. M. & Herrera, F. A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014).
Article Google Scholar
Wu, X., Zhu, X., Wu, G.-Q. & Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2013).
Google Scholar
Hengl, S., Kreutz, C., Timmer, J. & Maiwald, T. Data-based identifiability analysis of non-linear dynamical models. Bioinformatics 23, 2612–2618 (2007).
Article CAS Google Scholar
Obermeyer, Z. & Emanuel, E. J. Predicting the future–big data, machine learning, and clinical medicine. New Engl. J. Med. 375, 1216 (2016).
Article Google Scholar
Harrell, F. E. Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Springer, 2015).
Book Google Scholar
Deutsch, A. J. Widespread Non-Additive and Interaction Effects Within Human Leukocyte Antigen Loci Modulate the Risk of Autoimmune Diseases. Ph.D. thesis (2017).
Ho, P. Y. et al. Investigating the role of the hla-cw* 06 and hla-drb1 genes in susceptibility to psoriatic arthritis: comparison with psoriasis and undifferentiated inflammatory arthritis. Ann. Rheumatic Dis. 67, 677–682 (2008).
Article CAS Google Scholar
Zacksenhouse, M., Braun, S., Feldman, M. & Sidahmed, M. Toward helicopter gearbox diagnostics from a small number of examples. Mech. Syst. Signal Process. 14, 523–543 (2000).
Article ADS Google Scholar
Mease, P. J. et al. Prevalence of rheumatologist-diagnosed psoriatic arthritis in patients with psoriasis in european/north american dermatology clinics. J. Am. Acad. Dermatol. 69, 729–735 (2013).
Article Google Scholar
Villani, A. P. et al. Prevalence of undiagnosed psoriatic arthritis among psoriasis patients: systematic review and meta-analysis. J. Am. Acad. Dermatol.J. Am. Acad. Dermatol. 73, 242–248 (2015).
Article Google Scholar
Roelofs, R. Measuring Generalization and overfitting in Machine learning. Ph.D. thesis, UC Berkeley (2019).
Patrick, M. T. et al. Genetic signature to provide robust risk assessment of psoriatic arthritis development in psoriasis patients. Nat. Commun. 9, 1–10 (2018).
Article CAS Google Scholar
Zhang, P. & Gao, W. Feature selection considering uncertainty change ratio of the class label. Appl. Soft Comput. 95, 106537 (2020).
Article ADS Google Scholar
Gao, W., Hu, L. & Zhang, P. Feature redundancy term variation for mutual information-based feature selection. Appl. Intell. 50, 1272–1288 (2020).
Article Google Scholar

Download references

Acknowledgements

This work was supported by Versus Arthritis (grant number 21173, grant number 21754 and grant number 21755). FJ is supported by an MRC/University of Manchester Skills Development Fellowship (grant number MR/R016615). RBW is supported by the Manchester NIHR Biomedical Research Centre. H.M-O is supported by the National Institute for Health Research (NIHR) Leeds Biomedical Research Centre (LBRC). This research has been conducted using the UK Biobank Resource (approved research ID 7996, Principal Investigator: Dr Suzanne Verstappen). SV is supported by Versus Arthritis (grant numbers 20385, 20380) and the NIHR Manchester Biomedical Research Centre. The authors would like to acknowledge the assistance given by IT Services and the use of the Computational Shared Facility at The University of Manchester. This work was part-funded by the NIHR Manchester BRC. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. The authors acknowledge the substantial contribution of the BADBIR team to the administration of the project. BADBIR acknowledges the support of the National Institute for Health Research (NIHR) through the clinical research networks and its contribution in facilitating recruitment into the registry. This research was funded/supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the BADBIR, NIHR, NHS or the Department of Health. The authors are grateful to the members of the Data Monitoring Committee (DMC): Dr Robert Chalmers, Dr Carsten Flohr (Chair), Dr Karen Watson and David Prieto-Merino and the BADBIR Steering Committee (in alphabetical order): Oras Alabas, Prof Jonathan Barker, Gabrielle Becher, Anthony Bewley, David Burden, Simon Morrison (CEO of BAD), Prof Phil Laws (Chair), Mr Ian Evans, Prof Christopher Griffiths, Shehnaz Ahmed, Dr Brian Kirby, Elise Kleyn, Ms Linda Lawson, Teena Mackenzie, Tess McPherson, Dr Kathleen McElhone, Dr Ruth Murphy, Prof Anthony Ormerod, Dr Caroline Owen, Prof Nick Reynolds, Amir Rashid, Prof Catherine Smith and Dr Richard Warren. The research was funded/supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. The authors thank all the patient participants and acknowledge the enthusiastic collaboration of all clinicians and research teams in the United Kingdom and the Republic of Ireland who recruited for this study. This study is supported by the Psoriasis Association and the National Institute of Health and Research Biomedical Research Centre at King’s College London/Guy’s and St Thomas’ National Health Service Foundation Trust. The authors are grateful to the members of the BSTOP Steering Committee (Prof David Burden (Chair), Prof Catherine Smith, Prof Stefan Siebert, Prof Sara Brown, Helen McAteer, Dr Julia Schofield and Dr Nick Dand) for their valuable role in oversight of study delivery. This work was supported by Psoriasis Stratification to Optimise Relevant Therapy (PSORT), which is in turn funded by a Medical Research Council Stratified Medicine award (MR/L011808/1), the Psoriasis Association (RG2/10), the National Institute of Health and Research Biomedical Research Centre at King’s College London/Guy’s and St Thomas’ National Health Service Foundation Trust, the National Institute of Health and Research Manchester Biomedical Research Centre , and the National Institute of Health and Research Newcastle Biomedical Research Centre . TT is supported by an MRC Clinical Research Training Fellowship (MR/R001839/1). ND is supported by Health Data Research UK (MR/S003126/1). The British Association of Dermatologists Biologics and Immunomodulators Register is coordinated by the University of Manchester and funded by the British Association of Dermatologists. Finally, we acknowledge the enthusiastic collaboration of all of the dermatologists and specialist nurses in the U.K. and the Republic of Ireland who provide the BADBIR data. The principal investigators at the participating sites are listed at the following website: http://www.badbir.org/Clinicians/.

Author information

A comprehensive list of consortium members appears at the end of the paper.

Authors and Affiliations

Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK
Farideh Jalali-najafabadi, Michael Stadler, Mehreen Soomro, Pauline Ho, Anne Barton & John Bowes
Department of Medical and Molecular Genetics, Faculty of Life Sciences and Medicine, King’s College London, London , UK
Nick Dand, Michael A. Simpson & Nick Dand
Department of Medicine, University of Cambridge, Cambridge, UK
Deepak Jadon
NIHR Manchester Musculoskeletal Biomedical Research Unit,Central Manchester NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
Pauline Ho, Anne Barton & John Bowes
NIHR Leeds Biomedical Research Centre, Leeds Teaching Hospitals Trust and Leeds Institute of Rheumatic and Musculoskeletal Disease, University of Leeds, Manchester, UK
Helen Marzo-Ortega & Philip Helliwell
Royal National Hospital for Rheumatic Diseases and Dept Pharmacy and Pharmacology, University of Bath, Bath , UK
Eleanor Korendowych & Neil McHugh
Division of Epidemiology and Public Health, University of Nottingham, Nottingham , UK
Jonathan Packham
St John’s Institute of Dermatology, Guys and St Thomas’ Foundation Trust, London, UK
Catherine H. Smith & Catherine H. Smith
St John’s Institute of Dermatology, Faculty of Life Sciences and Medicine, King’s College London, London, UK
Jonathan N. Barker & Jonathan N. Barker
Dermatology Centre, Salford Royal NHS Foundation Trust, University of Manchester, Manchester, UK
Richard B. Warren, Richard B. Warren & Catherine H. Smith

Authors

Farideh Jalali-najafabadi
View author publications
You can also search for this author in PubMed Google Scholar
Michael Stadler
View author publications
You can also search for this author in PubMed Google Scholar
Nick Dand
View author publications
You can also search for this author in PubMed Google Scholar
Deepak Jadon
View author publications
You can also search for this author in PubMed Google Scholar
Mehreen Soomro
View author publications
You can also search for this author in PubMed Google Scholar
Pauline Ho
View author publications
You can also search for this author in PubMed Google Scholar
Helen Marzo-Ortega
View author publications
You can also search for this author in PubMed Google Scholar
Philip Helliwell
View author publications
You can also search for this author in PubMed Google Scholar
Eleanor Korendowych
View author publications
You can also search for this author in PubMed Google Scholar
Michael A. Simpson
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Packham
View author publications
You can also search for this author in PubMed Google Scholar
Catherine H. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan N. Barker
View author publications
You can also search for this author in PubMed Google Scholar
Neil McHugh
View author publications
You can also search for this author in PubMed Google Scholar
Richard B. Warren
View author publications
You can also search for this author in PubMed Google Scholar
Anne Barton
View author publications
You can also search for this author in PubMed Google Scholar
John Bowes
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

BADBIR Study Group

Catherine H. Smith
, Jonathan N. Barker
& Richard B. Warren

BSTOP Study Group

Nick Dand
& Catherine H. Smith

Contributions

J.B. and A.B. contributed to the conception and study design and provided expert guidance, and reviewed the manuscript. F.J. designed the machine learning experiments, carried out the experiments on the data and wrote the paper. M.S., N.D., D.J., M.S., P.H.O., H.M.-O., P.H., E.K., M.S., J.P., C.S., J.B., N.M., R.B.W. contributed to the data collection, and/or Q.C. and imputation and reviewed the manuscript.

Corresponding author

Correspondence to Farideh Jalali-najafabadi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jalali-najafabadi, F., Stadler, M., Dand, N. et al. Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models. Sci Rep 11, 23335 (2021). https://doi.org/10.1038/s41598-021-00854-x

Download citation

Received: 03 June 2021
Accepted: 27 September 2021
Published: 02 December 2021
DOI: https://doi.org/10.1038/s41598-021-00854-x

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.