Likelihood contrasts: a machine learning algorithm for binary classification of longitudinal data

Machine learning methods have gained increased popularity in biomedical research during the recent years. However, very few of them support the analysis of longitudinal data, where several samples are collected from an individual over time. Additionally, most of the available longitudinal machine learning methods assume that the measurements are aligned in time, which is often not the case in real data. Here, we introduce a robust longitudinal machine learning method, named likelihood contrasts (LC), which supports study designs with unaligned time points. Our LC method is a binary classifier, which uses linear mixed models for modelling and log-likelihood for decision making. To demonstrate the benefits of our approach, we compared it with existing methods in four simulated and three real data sets. In each simulated data set, LC was the most accurate method, while the real data sets further supported the robust performance of the method. LC is also computationally efficient and easy to use.

1 0 is comparable to the linear predictor (i.e., the linear combination of covariates) found in logit models. In logit models, the linear predictor is mapped to probability score through the inverse of the logistic link function. Adopting this approach, we have It is quite naturally possible to extend LC into a multinomial classifier. In that case, where L is the number of classes. However, we only use binary classification in this paper. Implementation with mixed models. Above, z i denotes any data, encompassing n i measurements for individual i. Here, we use LC in combination with LMEs, which are versatile tools for modelling jointly the effect of covariates, confounders and sampling artefacts. In matrix notation, an LME can be defined as where y is the vector of responses, X is the matrix of covariates, β is the vector of regression coefficients, and  is the error term. A X ( ) is the covariance matrix of the measurement errors which may depend on X. An LME differs from the usual linear model because A X ( ) is not a diagonal matrix. The off-diagonal terms of A X ( ) represent correlations between the samples, and are considered as the 'mixed effects' . They arise as a result of study design or spatio-temporal vicinity of the samples.
In practice, we assume that the data for each individual i involve a disease-specific longitudinal marker y ij and a number of covariates denoted by x ij . In the sequel, we denote the j th measurement of a marker for individual i as y ij , and we model the time course of this marker as  denotes a measurement error. Here, μ σ NID( , ) 2 means normal, independent and identically distributed with mean μ and variance σ 2 . There are two models, as there are two groups of patients (δ = 1 i and δ = 0 i ). The purpose of LC is to distinguish between these groups. An LME-based LC algorithm is implemented in R and is publicly available at https://elolab.utu.fi/software/.
Connection to statistical paradigms. The decision rule of LC resembles the likelihood-ratio (LR) test and Bayesian posterior probabilities but differs from both. In this subsection, we discuss these similarities and differences. Generally speaking, as compared to LR test, our method is not based on nested models, and is thus more general. As compared to Bayesian inference, our method does not require numerical integration schemes, and is thus more efficient. Moreover, our method is suited for situations where the likelihood contribution of each individual observation cannot be calculated, as it uses the change of log-likelihood as a proxy for likelihood contribution.
In more detail, the LR test statistic is defined as This definition is very standard and can be found from statistics text books. It is based on the fact that the likelihood ratio thus defined has favourable distributional properties known as the Wilks' theorem. The LR test, however, is based on nested models. For our method, the implied test statistic is and it can be calculated for disjoint models. In parallel with this, Bayesian posterior probabilities typically concern the whole data, as the purpose is to choose the optimal model. Contrary to this, LC concerns each observation separately. To see the connection to Bayesian inference, assume that the information in − y i 0 and − y i 1 is so great that it essentially fixes the values of θ 0 and θ 1 . Following this,  18 .) Moreover, let us assume that the observation units are independent, and thus, the likelihood is separable as Thus, it follows that Now, if one gives equal prior weights for both models, i.e. π π = M M ( ), ( ) 1/2 0 1 , it follows that π π π π π π π | = | i.e. the posterior probability of M 1 for y i coincides with our probability score, see Eq. (2). Finally, someone might ask why we use the likelihood contrasts d 0 and d 1 to classify individual i, and not just the likelihood contributions 1 . This is because in complex models, such as LME, it is not possible to calculate likelihood contributions as such. Thus, we use the likelihood contrast Comparison with other methods. We compared the performance of LC to a number of statistical and machine learning methods, including LME, linear feature extraction (LF), logit mixed-effects regression (implemented as the function GLMER in the R package lme4 19 ), Lasso, random forests (RF), support vector machines (SVM), and neural networks (NN). Among the compared methods, LME and GLMER represent statistical methods, while Lasso, RF, SVM and NN are widely used machine learning algorithms. LF uses a strategy to account for the longitudinal dimension, but relies on a standard statistical technique, logistic regression. Below, we briefly outline these methods using the notations given above. In all analyses, the task was to predict δ i on the basis of the covariates x ij and the longitudinal marker y ij . We denote the averaged value of the longitudinal marker over time by y i and averaged covariates by x i .
In LME, the marker was first modelled as 2 denotes an individual-specific random effect, and  σ NID(0, ) ij 2 2 denotes a measurement error. Then, we averaged the estimated values ŷ ij over time = … j n 1, , i and ran a logit regression of δ i on the averages. We fitted the LME models using the R package nlme version 3.1-131.1 12 .
In LF, we first ran a linear regression of y ij on t ij within each individual i to obtain individual-specific slopes and intercepts, denoted by b i and a i , respectively, www.nature.com/scientificreports www.nature.com/scientificreports/ The GLMER model was constructed on δ x y t ( ; , , ) i ij ij ij using the R package lme4 version 1.1-15 19 . The RF model was constructed on δ x y t ( ; , , ) i i i i using the R package randomForest version 4.6-12 21 . An SVM (more precisely, epsilon regression) was constructed on δ x y t ( ; , , ) i i i i using the R package e1071 version 1.6-8 22 .
An artificial neural network was constructed with one hidden layer on δ x y t ( ; , , ) i i i i using the R package nnet version 7.3-12 23 .
Methods LME, Lasso, RF, SVM and NN involved averaging of values before modelling. We also implemented the methods without averaging by considering each time point as a separate measurement. We denote these methods by LME2, Lasso2, RF2, SVM2 and NN2.
All machine learning models were built using default parameters. Internal cross validation was used to determine coeffiecients for the logistic model and the penalty factor in Lasso. RF implemented Breiman's random forest algorithm using 500 trees with sample replacement. In SVM, support vectors were defined using epsilon regression with ε = . 0 1. NN used one hidden layer and the number of units in the hidden layer was determined to be half of the number of variables.
Note that methods LC, LF, LME, LME2, Lasso2, RF2, SVM2, NN2 and GLMER use information for each time point, while methods Lasso, RF, SVM and NN use information averaged over time points per subject. Out of the compared methods, only LC, LF and GLMER directly make prediction for a new subject, while the other methods create a prediction for a single time point. For these methods, we made predictions for each time point for each subject, and averaged the predictions to obtain a single prediction for each subject.
Model evaluations. We compared the performance of the different methods on the basis of their binary predictions for test data, using cross validations as explained in the sequel. We truncated the probability scores given by the different models into binary predictions by using 0.50 probability as the cut-off and then assessed the performance of the binary predictions by calculating sensitivity and specificity. We considered 0.50 as the baseline value of sensitivity and specificity, assuming that a completely uninformative classifier is equally likely to classify the subjects as cases or controls. We used Wilcoxon's rank sum test to compare the sensitivity and specificity obtained from each method to the baseline values. Different methods were compared using paired Wilcoxon's rank sum test. To account for multiple testing, we applied Benjamini-Hochberg false discovery rate (FDR) correction 24 to the Wilcoxon's rank sum test P-values.
Additionally, we also present the F1 scores, accuracies and receiver operating characteristic (ROC) curves for each method and data set in Supplementary material.

Materials
To evaluate LC along with existing predictive methods we used simulated and real data.
Simulated data. In the simulated data, we considered one static covariate, the 'treatment' denoted by x i , and one longitudinal marker denoted by y ij . We assumed here that the distributional form of the marker differed between cases and controls, and thus, y ij was informative regarding the case-control label δ i . Altogether, we considered four different scenarios described in detail below.
In each scenario, the individuals were equally likely to be cases or controls. We assumed four time points per individual ( = n 4 i ) and we assumed x i to be Bernoulli distributed with parameter value 0.5 and t ij to be uniformly distributed on the interval − ( 1, 1), i.e. we assumed that the treatment was allocated randomly and the time was measured relative to the event. In each scenario, 1,000 replicate data sets were generated to control for the sampling variation.
In Scenario 1, we assumed that the cases and controls reacted differently to the treatment and also that the natural course of the marker was different between the groups. Thus, we specified the model as In Scenario 2, we assumed that the distribution of y ij was more similar between the cases and controls than in Scenario 1. We assumed that the controls did not react to the treatment and the natural course of the marker was similar between the groups, albeit milder in controls. Thus, we specified the model as  www.nature.com/scientificreports www.nature.com/scientificreports/ Real data. We used two clinical data sets and one high-throughput molecular data set. In each of these real data sets, we used 2/3 of the data set as training data and predicted the labels in the remaining 1/3 to assess the performance of the different methods. To control for sampling variation in model evaluation, we repeated the process 1,000 times which diminished the standard errors more than sufficiently (<0.01 for variables on a scale of 0-1).
Clinical data sets. The two clinical data sets (Pbc2 and Prothro) were chosen from Rizopoulos 14 , distributed in the R package JM (version 1.4-7) 15 . The sample sizes of the real data sets are summarised in Table 1. In both clinical data sets, δ = 1 i means death and δ = 0 i staying alive. The Pbc2 data set was from a study on primary biliary cirrhosis 25 . The longitudinal marker in this data set was logarithm of blood bilirubin (mg/dl) over time and we used the drug (placebo or D-penicillamine) as a static covariate. Time (years) and bilirubin values were scalar numbers and drug status had a binary value.
The Prothro data set was from a study of liver cirrhosis 26 . The longitudinal marker was prothropin level and we used the treatment (placebo or prednisone) as a static covariate. Time (years) and prothrombin levels were scalar numbers, and treatment status was a binary variable.
For Pbc2 and Prothro, we used the LME regression (in methods LC and LME) motivated by Rizopoulos 14 as High-throughput molecular data set. The high-throughtput molecular data set was from the Finnish Type 1 Diabetes Prediction and Prevention (DIPP) study 16 and involved preprocessed mRNA expression levels measured on Affymetrix Human Genome U219 microarray 17 . There was a total of 49,386 probes corresponding to different genes in the data that were measured over time. The data were downloaded from The National Center for Biotechnology Information webpage (https://www.ncbi.nlm.nih.gov/) using Gene Expression Omnibus identifier GSE30211. Each probe was z-scored. The mRNA data consisted of two separate data sets: seroconverted children and progressors. The data set of seroconverted children contained samples from subjects close to the time when diabetes-related autoantibodies developed. Samples of the progressors' data set were concentrated close to the diagnosis of diabetes. Here, we focused on the clinical phenotype, i.e. progressors, and defined the case-control label as diagnosed (δ = 1 i ) or not diagnosed (δ = 0 i ) with Type 1 diabetes. To select the probes to be used as the longitudinal covariates, we first used the data from seroconverted children. For each seroconverted child, the first four follow-up samples were selected. Probes with median expression lower than the median of median expressions (5.47) were excluded. The remaining 24,693 probes were ranked in two ways: 1., a ranking based on Wilcoxon's rank sum test P-value between cases and controls for all samples, and 2., a ranking based on Wilcoxon's rank sum test P-value between cases and controls for subjectwise median values. The two rankings were combined by taking the average rank and top five probes were selected. The selected probes were 11751509_a_at, 11723996_a_at, 11759536_a_at, 11733701_a_at and 11748922_x_at, and they mapped to genes RCN1, GLCCI1, TTC17, FKBP11 and NSMF, respectively. For simplicity, we will refer to the probes by using their gene symbols. No static covariates were used.
To construct the predictive models we used the data set progressors. For methods LC and LME, we used age as marker y ij and top 5 probes as longitudinal covariates x ij as For the other methods, we trained the models by using averaged information from the 5 top probes and age.

Results
In this section, we represent the results for four scenarios of simulated data and three real data sets. In the simulated data, we considered one static covariate and one longitudinal marker. The four simulation scenarios differed in the distributions of the markers and in sample sizes. The real data sets contained two clinical data sets and one high-throughput molecular data set. We emphasise that we used multiple simulation replicates for simulated data, and exhaustive cross validation for real data. We compared the performance of LC to a number of statistical and machine learning methods, including LF, LME, GLMER, Lasso, RF, SVM and NN, in terms of their sensitivity and specificity. Results for the methods LME2, Lasso2, RF2, SVM2 and NN2 are collected in the Supplementary material. Additionally, we present the F1 scores, accuracies and the receiver operating characteristic (ROC) curves for each method in the Supplementary material. www.nature.com/scientificreports www.nature.com/scientificreports/ Simulated data. In the simulated data, all methods had fairly good performance (Fig. 1, Supplementary Figs. 1-3, and Supplementary Table 1). All other combinations of methods and scenarios had highly significant sensitivity (P < 1.0 × 10 −6 ) and specificity (P < 1.0 × 10 −6 ) compared to the baseline value 0.5, except for sensitivity of Lasso in Scenario 2 (P = 0.90). Thus, it seems that Lasso was not able to distinguish between the cases and controls on the basis of the overlapping distributions of the temporal averages. In all scenarios, LC was the best method in terms of both sensitivity (in each pairwise comparison P < 1.0 × 10 −6 ) and specificity (in each pairwise comparison P < 1.0 × 10 −6 ), closely followed by LF. The F1 scores, accuracies and ROC curves supported similar conclusions (see Supplementary Table 1, and Supplementary Figs. 2 and 3).

Number of controls
The simulated scenarios differed from each other so that Scenarios 1 and 3 had greater distinction between cases and controls than Scenarios 2 and 4. On the other hand, Scenarios 3 and 4 had more samples than Scenarios 1 and 2. As expected, most methods achieved the best results in Scenarios 3 and 4, which had higher numbers of training samples (Fig. 1, Supplementary Fig. 1). However, for some methods (LC and LME), the difference between Scenarios 1 and 3, i.e. a difference attributable to sample size, was very small. Regarding the effect of the sample distributions, better results were achieved in Scenario 1 than in Scenario 2, as Scenario 2 had a smaller difference between the sample groups. A similar observation holds for Scenarios 3 and 4, as expected (Fig. 1, Supplementary  Fig. 1). Methods LME2, Lasso2 and RF2 had similar performance compared to the corresponding averaged methods LME, Lasso and RF. In Scenarios 1 and 3, methods SVM2 and NN2 outperformed SVM and NN.
To conclude, the results obtained from the simulated data sets demonstrated that all methods could deliver meaningful results, and LC had a very good performance, as compared to the twelve other contemporary approaches tested.
Real data. Although all the methods tested here performed well in the simulated data sets, this pattern changed when we moved to the real data sets (Fig. 2, Supplementary Figs. 4-6, and Supplementary Table 2). While LC and RF had both specificity and sensitivity highly significantly over 0.50 in all three real data sets (P < 10 −6 ), this was not the case for any of the other methods. Instead, all the other methods had either sensitivity or specificity below 0.55 in at least one data set. The Prothro and DIPP data sets turned out to be the hardest to predict in terms of sensitivity and specificity. In the Prothro data, LC achieved sensitivity of 0.65 and specificity of 0.70, and in the DIPP data, sensitivity of 0.84 and specificity of 0.63. The relative difficulty of the real data sets was also seen in the F1 values, accuracies and ROC curves (Supplementary Table 2, Supplementary Figs. 5 and 6). LC was the only method that obtained accuracy and F1 value higher than 0.7 in all real data sets (Supplementary  Table 2). In all real data sets, methods LME2, Lasso2, RF2, SVM2 and NN2 were slightly outperformed by the corresponding averaged methods.
Based on these results, one may conclude that LC was the only robust method among the tested thirteen, as it did not fail in any data set analysed in this study.

Discussion and Conclusions
In this study, we examined thirteen methods for binary classification of longitudinal data with non-aligned time points, which is a common scenario in biomedical studies. (Most of these methods needed to be adjusted on an ad-hoc basis to acknowledge for the longitudinal nature of the data, i.e. we used temporal averages of covariates. However, this was not the case for our method.) We introduced the method of likelihood contrasts (LC) and compared its performance to the twelve other approaches, using simulated data and three real data sets. In the simulated data sets, LC clearly outperformed the other methods in terms of sensitivity and specificity (P < 1.0 × 10 −6 ). In the real data sets, the performance of all methods was lower than in the simulated data. However, unlike most of the other methods, LC provided reasonable classification performance also in all the real data sets (specificity and sensitivity significantly over 0.50, P < 10 −6 ), demonstrating its robustness over the other methods.
Another benefit of LC is that it can be generalised to any analysis scenario consisting of two models and a measure of model fit. For example, we used the difference of log-likelihood to measure the agreement between a new observation and the pre-existing data. In a non-parametric setting, one could use some other measure of model fit, such as the difference of mean squared errors.
Presently, we have used LC in combination with two LMEs. This derives from previous modelling tradition for longitudinal data 6,7 , but could also be changed. For example, the log-likelihood could be derived from a non-linear regression with time. However, a benefit of the LME framework is that it is fairly general, and the theory of these models is well-known. Moreover, the LME framework can easily be extended to allow for a wider range of applications. For example, it is possible to use penalised random-effects models for automated model choice 4 .
There are multiple studies comparing different binary classifiers for biomedical single time-point data. For example, Khondoker et al. 27 compared four classification methods using simulated and real data sets. They concluded that linear discriminant analysis gave the best results for small data sets and SVM for data sets with more than 20 samples. Johnson et al. 28 compared six classifiers in RNA-seq data. They found random forest to be the www.nature.com/scientificreports www.nature.com/scientificreports/ best method, and transcript-level data to be better suited for classification than gene-level data. Babu et al. 29 studied the effect of feature selection for classification methods in cancer. They found that feature selection substantially improved the performance of classification, and in their study, SVM was one of the best methods together with Relief-F 30 and information gain 31 .
Given the prevalence of longitudinal data sets in biomedicine, it is surprising that there are so few longitudinal binary classifiers. Longitudinal time-series experiments using DNA microarrays have already been performed for more than a decade [32][33][34] . In line with this, various statistical methods have been developed to detect the differentially expressed genes between experimental groups in the longitudinal data. For example, MaSigPro 32 can be used to analyse inter-group differences by fitting polynomials of various degrees to expression data. The moderated F-test in limma 35 can be used to discover differentially expressed genes by considering intergroup contrasts at different time points. Approaches relying on Bayesian statistics for detecting longitudinal differential gene expression have also been developed 33,34 . However, none of these methods directly addresses the question of classifying the longitudinal samples in two distinct groups, e.g. in patients and healthy controls. This is a shortcoming which we try to address by the proposed LC method.
Regarding binary classification of longitudinal data, there are some methods which operate on aligned time points 4 . However, a fully flexible model such as LC has not been developed before. Consequently, it is difficult to judge the performance of LC against a pre-existing baseline. Furthermore, regarding binary classification in static settings, earlier studies have not been able to highlight a single generally best method. For example, Pirooznia et al. 36 used eight machine learning algorithms in eight microarray gene expression data sets, and they found SVM to have the best performance. In line with this, Castillo et al. 37 analysed RNA-sequencing and microarray data sets, finding SVM to be more accurate than RF or nearest-neighbour classification. However, Bienkowska et al. 38 used SVM and RF in combination with iterated feature selection in gene expression data. They found RF to outperform SVM in three out of four cases, in terms of area under the ROC curve. Such comparisons are numerous and can be found in many application areas 39,40 .
The two real clinical data sets used in our study were taken from Rizopoulos 14 . As our primary purpose was to compare the performance of the different classifiers, we used the same covariates and markers as the earlier studies 14 . In the Type 1 Diabetes data 17 , the choice of the predictive markers was less obvious. In these high-throughput molecular data, there were initially 49,386 measured probes. Following observations from previous studies, we filtered these features. For example, Pirooznia et al. 36 have reported that up to 10% losses in predictive accuracy can be expected, if all features are used in place of an optimal feature set.
To conclude, results obtained in this study suggest that LC can be used as an accurate binary classifier in longitudinal data. LC outperformed the conventional machine learning methods in the simulated data. Although the three real data sets proved to be more difficult to predict correctly than the simulated data, LC was able to deliver statistically significant predictions in all data sets.

Data availability
The data sets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.