Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data

Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.


Details about hyper-parameters optimization, settings and architectures
The hyper-parameters of the different models have been optimized through standard 10-fold cross-validation (CV) on the Train test. To have control on overfitting, each model has been trained for different sets of parameters, and optimal values have been defined as those that maximize the mean AUC values over the validation folds. These optimal values have been then kept for the final models trained on the whole Train set and tested on the Test set.

Logistic Regression
We have used Scikit-Learn [1] classifier sklearn.linear_model.LogisticRegression and fixed by CV the hyperparameters: -C 1 : inverse of regularization strength for l1 penalization, -C 2 : inverse of regularization strength for l2 penalization. For the ElasticNet case, we used sklearn.linear_model.SGDClassifier and CV over the hyper-parameters: -alpha: a constant that multiplies the regularization term, -l1_ratio: the Elastic Net mixing parameter.

Dense NN with different numbers of fully connected hidden layers, all composed by 64 neurons
We used the architecture: for j in range (2) We optimized under CV the number E of epochs for training. We

Gradient Boosting Trees
We used classifiers of XGBoost [4] ('objective':'binary:logistic'), LightGBM [5] ('objective': 'binary', 'metric':'binary_logloss') and CatBoost [6] (CatBoostClassifier, loss_function='Logloss'). All these models have tenth of parameters to tune, but we focused CV only on those which are known to affect the most the final score. We left the others at their default values. We list in the following the hyper-parameter values which gave the best mean AUC score under CV.

Complementary results on feature importance selection
In the main text, we have studied the importance of the features selected by the different models when the same criterion (permutation feature importance score) is used or when other ranking scores (weights or gain) are considered. In this section, we show some complementary results concerning the problem of determining the most important features for the Crohn case-control study.

Dependence on the number of permutations
The permutation feature importance score is obtained by randomly permuting on the at the level of each feature on the test set. The larger the deviation from the original AUC, the highest was the rank importance of the feature. However, this result is not very reliable at the level of a single permutation, because random effects related to the limited sample size can affect the score. Therefore, the final score for each feature appear to be more consistent when obtained after averaging over the scores given by several different permutations. In Fig.  S2 we show the results for the Spearman rank test on logistic regression with Lasso regularization, LightGBM and the ResDN3 neural network, respectively for the first 50 and 400 loci, when the number of permutations per features over which we averaged is varied. One can see how around N=10 permutations the results seem to converge to stable values. We show the Spearman Rank test coefficient r s respectively for the best 50 (panel A) and 400 (panel B) loci, as a function of the number N of permutations per feature. The features ranks are given after averaging over the N permutation feature importance scores. We show the results for logistic regression with Lasso regularization (LR), LightGBM (LGBM), and a dense residual neural network with 3 hidden layers (ResDN3). We consider the intersection between the same kind of models, when trained on two different subsets of the data. Solid lines represent the mean values, respectively over all the couples of subsets (10 subsets, for a total of 45 couples), while shaded regions represent the 1 standard deviation confidence intervals.

Spearman rank Test
In the main text, we compared the stability of the feature importance scores for the different models under study, in terms of the robustness R. In Fig. S3 we show the same results in term of the Spearman rank test r s coefficients, as a function of the first x best loci. In Fig. S3A, for a given model, we computed r s for the ranks given by the same criterion applied to two different subsets of training data. To compare the results for two different criteria when the same model is trained on a given subset of data, in Fig. S3B the r s coefficients when comparing LR with PFI and weight, and LGBM with PFI and gain are shown.

A B
Spearman coefficients Spearman coefficients

Number of permutations
In Fig. S3C we studied the consistency of the rankings given by two models (with given criterion) on the same subset of training data. Finally, in Fig. S3D we evaluated the consistency of combined models on couples of subsets of training data, as defined in the main text. All in all, the results shown are qualitatively the same of those of Fig. 5 in the main text, but on a different scale, indicating the interchangeability of robustness R and Spearman r s coefficient as measures to estimate the consistency of the feature importance ranking.

Fig. S3. Comparison between permutation feature importance and other ranking scores.
We show the Spearman Rank test coefficient r s as a function of the first x best loci. In panel A we consider the robustness of a given model/criterion, when trained on two different subsets of the data. In panel B we show the robustness between the same model when two different criteria are considered on the same subset of the dataset. In panel C we compare two different models/criteria, on the same subset of the dataset. Finally in panel D we show the same analysis of panel A for combination of models. Solid and dotted lines represent the mean values of the robustness distributions, respectively in panel A and D over all the couples of subsets (10 subsets, for a total of 45 couples), and in panel B and C over all the subsets (10 subsets, for a total of 10 couples). Shaded regions represent the 1 standard deviation confidence intervals.
We resumed the results given by the Spearman Rank Test for the 400 best loci in the heat-maps of Fig. S4. In particular, we separated the results given by different models but the same feature selection criterion (Fig. S4A), to mixed scenarios (Fig. S4B). Once again the results were very similar to those obtained in terms of robustness R in Fig. 3A Table S1. Lists of the 400 most important loci.
We list here the list of the 400 most important loci selected, when averaged over 10 different folds of the dataset, respectively by Lasso regularization with weight criterion (LR weight), Lasso regularization with permutation feature importance criterion (LR PFI), LightGBM with gain criterion (LGBM gain), LightGBM with permutation feature importance criterion (LGBM PFI), and a dense residual neural network with 3 hidden layers with permutation feature importance criterion (ResDN3 PFI). For each locus, we indicate the chromosome and the center of the corresponding 500kb window (250kb on each side), in hg19 coordinates.