A personalised approach for identifying disease-relevant pathways in heterogeneous diseases

Numerous time-course gene expression datasets have been generated for studying the biological dynamics that drive disease progression; and nearly as many methods have been proposed to analyse them. However, barely any method exists that can appropriately model time-course data while accounting for heterogeneity that entails many complex diseases. Most methods manage to fulfil either one of those qualities, but not both. The lack of appropriate methods hinders our capability of understanding the disease process and pursuing preventive treatments. We present a method that models time-course data in a personalised manner using Gaussian processes in order to identify differentially expressed genes (DEGs); and combines the DEG lists on a pathway-level using a permutation-based empirical hypothesis testing in order to overcome gene-level variability and inconsistencies prevalent to datasets from heterogenous diseases. Our method can be applied to study the time-course dynamics, as well as specific time-windows of heterogeneous diseases. We apply our personalised approach on three longitudinal type 1 diabetes (T1D) datasets, where the first two are used to determine perturbations taking place during early prognosis of the disease, as well as in time-windows before autoantibody positivity and T1D diagnosis; and the third is used to assess the generalisability of our method. By comparing to non-personalised methods, we demonstrate that our approach is biologically motivated and can reveal more insights into progression of heterogeneous diseases. With its robust capabilities of identifying disease-relevant pathways, our approach could be useful for predicting events in the progression of heterogeneous diseases and even for biomarker identification.

Dataset from Ferreira et al. [2014] For assessing generalisability, we performed time-course analysis using our personalised approach on a different T1D dataset comprising of infant samples published by Ferreira et al. [2014]. Using this dataset, we aimed to identify disrupted pathways during the early progression of T1D, which is one of the analyses performed on Dataset 1 from Kallionpää et al. [2014]. Specifically, we analysed time-course data from 18 T1D susceptible participants (9 cases and 9 controls) of the BABYDIET study [Hummel et al., 2011]. Ferreira et al. [2014] published data from 109 participants of the BABYDIET study of which 87 were autoantibody negative (Aab-) and 22 were autoantibody positive (Aab+). From the 22 Aab+ individuals, we chose 9 individuals as cases who were sampled before and after serconversion, and were sampled at least at three time points each. As per the reasoning provided for Kallionpää et al. [2014] dataset, for each case we found a matched control from the 87 Aab-individuals based on their time of birth, gender and sampling ages. Matching could not be done based on HLA-DQB1-conferred genetic risk class as well, since that information was not available for this dataset. Total mRNA was extracted from the peripheral blood mononuclear cells (PBMCs) isolated from venous blood samples taken at each time point from each individual and hybridised on Affymetrix Human Gene 1.1 ST exon arrays. The raw data (accession code: E-MTAB-1724) was downloaded from ArrayExpress and preprocessed using oligopackage in R. RMA normalisation was performed at a transcript level using only the core probe-sets. At the transcript level, information is abstracted away from the probe-set level into something that more resembles genes. Transcripts were then annotated with gene names using Ingenuity Pathway Analysis (IPA) before differential expression analysis.

Robustness analysis of the Gaussian process models
Gaussian processes (GPs) are known to robustly estimate missing or unobserved values as they provide confidence intervals along the estimated curves of gene expression [Kalaitzis and Lawrence, 2011]. For GP regression with the Gaussian likelihood model, the predictive distribution is also Gaussian, which provides a point prediction through its mean and uncertainty quantification through its associated variance. GPs can model the true underlying signal while being robust to outliers as well as noisy training data. We demonstrate the effectiveness of our GP modelling in predicting unobserved values (and hence the timecourse behaviour) and robustly estimating the dynamics of time-course data in the presence of missing data points. We also demonstrate the robustness of our personalised approach to noise in the data while inferring disrupted pathways.

Leave-one-out cross-validation analysis
In order to illustrate that our GP model can robustly model the dynamics of the time-course data and effectively predict the time-course behaviour, we performed a leave-one-out cross-validation on our GP model fitting using data from case-control Pair 9. Concretely, we randomly removed one data point (either from the case or control) for each probe-set of Pair 9 in Dataset 1. We then performed time-course analysis by fitting the GP models and computing the BF-scores as described in Section 3.4. Subsequently, we compared the BF-scores obtained from leave-one-out analysis to the the BF-scores obtained for case-control Pair 9 from the original analysis by computing Pearson's linear and Spearman's rank correlation values, which were 0.9712 and 0.9398, respectively. Originally, 747 genes (after mapping probe-sets to genes) were found differentially expressed (DE) (BF-score > 4) for this pair, whereas after the leave-one-out analysis, 739 genes were found DE of which 91% overlapped with the original list. This shows that the list of DE genes were similar even after randomly removing one data point from the analysis of each probe-set and demonstrates the robustness of our GP model in inferring the dynamics of the gene expression time course data. Furthermore, we used the estimated gene expression models (i.e. fitted GP model) to predict the randomly removed data point. In other words, we used the better fitting GP model (i.e. separate or joint model) depending on the computed BF-score to infer whether the original value lies in the 95% confidence interval inferred by the predictive distribution of the GP model. This can be seen as a form of cross-validation. From this analysis, we found that the original expression values of 87% of the randomly removed data points were in the inferred 95% confidence interval of the predictive distribution.

Measurement noise analysis
We simulated the effect of additional measurement noise in the gene expression data of Dataset 1 by adding random noise sampled from N (0, σ 2 ) to each probe-set expression measurement from all time points of the 6 case-control pairs, i.e. additive random noise was sampled (and added) for each value in the whole data matrix. After adding the noise, we computed the BF-scores for each probe-set and case-control pair as in Section 3.4 and determined the significance levels of each pathway using our personalised approach for time-course analysis as in Section 3.6. We chose three different variance values, σ 2 , for the noise distribution motivated by the variance observed in the data. In the original gene expression data, approximately 99.7% of the probe-sets had sample variance less than 1 across all time point measurements from all case-control pairs. Supplementary Figure 5 shows a histogram of the estimated probe-set sample variances that are below 1. We performed three separate pathway-level analyses using σ 2 values of 0.04, 0.09 and 0.25 for the noise distribution, showed in red lines in Supplementary Figure 5. These σ 2 values were chosen strategically to represent a range of noise distributions. In Dataset 1, noise variances of 0.04, 0.09 and 0.25 correspond to 12%, 57% and 95% quantiles, respectively.
After performing the pathway analyses on each noisy dataset, we compared the FDR values of all pathways obtained after adding each level of noise with different σ 2 value, to the original FDR values of the pathways using Spearman's rank correlation test in order to understand the effect of adding noise to the gene expression data. We found that the Spearman's rank correlation values were between 0.31 and 0.37, which were found to be highly statistically significant with p-value ≈ 0 (for all three added noise levels). Additionally, we calculated the Spearman's rank correlation on a shorter list of 32 disease-relevant pathways that were found enriched in the original TC analysis (highlighted in colours in Supplementary table 1). The rank correlation values were between 0.58 and 0.77 (decreased as added noise distribution's variance increased), which were also found highly statistically significant with p-values < 10 −3 . Most importantly, many relevant pathways, such as those related to T1D, antigen processing and presentation, interferon gamma signalling, etc. were found significant even after adding noise with σ 2 = 0.25. These results confirm our methods robustness to added measurement noise.

Computational complexity
Inference for GPs involves the inversion of the kernel matrix. Hence, the time complexity scales as O(N 3 ), where N is the number of time points for a probe-set. This does not pose a problem in our setting (and for most time-course gene expression datasets) as the number of measured time points for a probe-set per case-control pair is usually quite small. However, a significant amount of research has already been done to bring down the time complexity of GPs. Some popular approaches are the inducing point methods [Snelson and Ghahramani, 2006;Quinonero-Candela et al., 2007;Hensman et al., 2013], structure exploiting methods such as Kronecker or Toeplitz methods [Saatçi, 2012;Cunningham et al., 2008], and even a unifying framework called structured kernel interpolation (SKI) [Wilson and Nickisch, 2015]. Our proposed personalised approach can be trivially adapted in the GP modelling stage to accommodate these methods for scenarios where a large number of data points are present. However, in scenarios where the number of data points is not very large (i.e. the time complexity is not a significant problem) these approximations are unnecessary.
We ran our method on a compute node with an Intel Xeon X5650 2.67 GHz processor. In terms of compute time, our method requires ∼3 hours to calculate the differential expression scores for all the probe-sets and ∼8 hours to generate the permutation distribution for pathway-level inference. Rest of the computations are trivial and do not take a significant amount of compute time. The amount of computation time required can be significantly brought down by reducing the number of pathway permutations (in our case 100,000).  Supplementary Figure 4: A comparative visualisation of the DEGs between the two approaches for the reactome interferon gamma signalling pathway pathway using Dataset 2, prefixed with 'T1D-', in the WT1D analysis. A coloured dot signifies that the gene is DE in the corresponding case-control pair. MHC classes I and II are independently marked as differentially expressed for each case-control pair if at least one gene in the respective group of HLA genes is detected as differentially expressed.