Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data

Pseudotime algorithms can be employed to extract latent temporal information from cross-sectional data sets allowing dynamic biological processes to be studied in situations where the collection of time series data is challenging or prohibitive. Computational techniques have arisen from single-cell ‘omics and cancer modelling where pseudotime can be used to learn about cellular differentiation or tumour progression. However, methods to date typically implicitly assume homogeneous genetic, phenotypic or environmental backgrounds, which becomes limiting as data sets grow in size and complexity. We describe a novel statistical framework that learns how pseudotime trajectories can be modulated through covariates that encode such factors. We apply this model to both single-cell and bulk gene expression data sets and show that the approach can recover known and novel covariate-pseudotime interaction effects. This hybrid regression-latent variable model framework extends pseudotemporal modelling from its most prevalent area of single cell genomics to wider applications.


Data retrieval and preprocessing Single cell analysis of primary mouse bone-marrow-derived dendritic cells
Single cell gene expression data for primary mouse bone-marrow-derived dendritic cells [1] were downloaded both as raw counts and TPM values from the conquer (consistent quantification of external rna-seq data) project [2] which uses Salmon [3]. A number of cells exhibited a low number of mapped reads and genes expressed ( Supplementary Fig. 1) that were removed.
In the original publication [1], a number of cluster disrupted cells that exhibit high expression of Serpinb6b and low expression of Lyz1 were identified. We therefore removed any cells whose log 2 (TPM + 1) expression of Serpinb6b exceeded 2.5 and in which Lyz1 was not expressed (Supplementary Fig. 2).
Finally, there exhibited large variation in the number of genes expressed in each cell ( Supplementary Fig. 3a) that correlated highly with the first principal component of the data (Supplementary Fig. 3B). Such an effect is known to be caused by underlying technical effects in single-cell RNA-seq data [4]. We therefore removed it using the normaliseExprs function in the R package Scater [5]. For the final analysis we retained 7,500 highly variable genes.

The Cancer Genome Atlas
For both COAD and BRCA studies, TPM matrices were retrieved from a recent transcript-level quantification of the entire TCGA study [6]. Clinical metadata, including the phenotypic covariates used in PhenoPath, were retrieved using the RTCGA R package [7]. Transcript level expression estimates were combined to gene level expression estimates using Scater [5].
A PCA visualisation of the COAD dataset ( Supplementary Fig. 4a) showed two distinct clusters based on the plate of sequencing. Rather than try to correct such a large batch effect, we retained samples with a PC1 score of less than 0 and a PC3 score greater than -10, and removed any "normal" tumour types. For input to PhenoPath we used the 4,801 genes whose median absolute deviation in log(TPM + 1) expression was greater than 1 2 . A PCA visualisation of the BRCA dataset ( Supplementary Fig. 4b) showed a loosely dispersed outlier population that separated on the first and third principal components. We performed Gaussian mixture model clustering using the R package mclust [8], and removed samples designated as cluster 2 in Supplementary  Fig. 4b, giving 1,135 samples for analysis. For input to PhenoPath we used the 4,579 genes whose variance in log(TPM + 1) expression was greater than 1 and whose median absolute deviation was greater than 0.

Model Specification
The overall generative model for PhenoPath is given by the following Bayesian hierarchical construction, where the default values of the free hyperparameters are shown in Supplementary Table 2.

Co-ordinate ascent variational inference
We perform co-ordinate ascent mean field variational inference with an approximating distribution of the form Due to the model's conjugacy the optimal update for each parameter θ j given all other parameters θ −j can easily be computed via where the expectation is taken with respect to the variational density over θ −j . We must therefore calculate each conditional distribution followed by its expectation.
The conditional distributions are given below (where θ|· can be interpreted as the conditional distribution of variable θ conditioned on all other variables and the data). For simplicity we assume the summation is obvious from the variable (i.e. p ≡ P p=1 , etc).

Conditional distribution of z
where k ng = λ g + p β pg x np .

Conditional distribution of α pg
whereμ αp ng = µ g + t n (λ g + p β p g x np ) + p =p α p g x np (6) in which p =p denotes the summation over 1 to P excluding p.
whereμ βp Conditional distribution of χ pg where Conditional expectation of z where we have used the fact that and for several variables that

Conditional expectation of α pg
where in both cases we have used the fact that For this we require the identities 6 This gives Conditional expectation of χ pg where again we have used the fact that To assess convergence of the CAVI algorithm we need to calculate the evidence lower bound (ELBO) at every iteration (or every i th iteration). The ELBO is given by where all expectations are taken with respect to the approximating distribution Q(·) and Θ denotes the full parameter set. Note that we are implicitly conditioning on the data wherever appropriate, so p(Y|Θ) ≡ p(Y|Θ, X). For this we require the result that if

Derivation of E[log p(Y|Θ)]
We have where µ ng = µ g + p α pg x np + z n λ g + p β pg x np . Then where f ng is defined as above and we have dropped additive terms since we are only concerned by changes in the ELBO.

Derivation of E[log p(Θ)]
We consider E[log p(z n )] which generalises to all parameters with Gaussian priors. We have Next consider E[log p(τ g )] which generalises to all parameters with Gamma priors. We have Thus the expression across all parameters up to a constant value is given by We consider E[log q z (z n )] which naturally generalises to all parameters whose approximating distributions are Gaussian. We have Similarly we consider E[log q τ (τ g )] which generalises to all parameters whose approximating distribution is Gamma. We have Summing this across all parameters gives

Simulation Setup
We sought to quantify the extent to which having a joint model of pseudotimes and interactions aids identification of the interactions and the extent to which such interactions confound trajectory inference using traditional methods. Mathematically, PhenoPath infers the posterior distribution p(z, β|Y) of the pseudotimes z and interaction parameters β given the data Y. The two step procedure we compare against is analogous to first inferring an estimate of the pseudotimesẑ|Y before inferring an estimate of the interaction parameters given the fixed pseudotimesβ|ẑ, Y.
To do this we simulated RNA-seq data where a certain proportion of the genes exhibited different behaviour over pseudotime depending on the external covariate status. We then re-inferred the pseudotimes using a variety of algorithms (including PhenoPath), and performed post-hoc differential expression (DE) analysis testing for interaction effects between the trajectory and the covariate using common DE algorithms. In-depth details about the simulation, inference, and results are detailed below.

Mean function
The first consideration is how we expect expression to change over pseudotime. We model this using non-linear sigmoid functions that have been successfully used to model single-cell RNA-seq evolution along trajectories previously (see e.g. [9]). Here, mean the expression of gene g in cell n at time t n is modelled as µ(t n , µ , k, and δ are gene-specific parameters corresponding to the half-peak expression, rate of increase (or decrease) in expression over (pseudo)-time, and the point in pseudotime at which the rate of change is maximal respectively. Crucially, these parameters may be the same for all cells or dependent on the external covariate status of a cell, leading to interactions and ultimately different behaviour over pseudotime dependent on covariate status ( Supplementary Fig. 5a). For each simulation, a certain proportion of genes were modelled as having different behaviour over pseudotime given the covariate status (the interaction). This mean function represents the behaviour of log-counts.
Noise model To simulate RNA-seq counts we used the negative binomial (NB) distribution that is the favoured noise model for RNA-seq counts (see e.g. [10,11]). Here, the probability of observing y ng counts in cell n and gene g is given by where φ is a size parameter that relates the variance to the mean via Var[y ng ] = µ ng + µng φ . For our simulations we created both a low noise models where φ = µ ng /3 + 1 (corresponding to near-Poissonian noise, suggested by the authors of Polyester [12] for low-noise simulations) and a high noise model where φ = 1 corresponding to overdispersed NB noise. The mean for each gene and cell used was 2 µ(tn,µ (0) g ,kg,δg) − 1, where the function µ is defined above.
It is worth emphasising just how misspecified this model is with respect to the PhenoPath model: 1. The important gene specific parameters (λ, α, β) do not exist 2. The change in expression over time is modelled by non-linear sigmoid functions 3. The noise model is negative binomial, giving discrete, non-negative counts rather than the Gaussian noise model assumed by PhenoPath Pseudotime inference Pseudotime inference was performed with PhenoPath, Monocle 2, DPT, and TSCAN. All algorithms were run with default parameter choices, with the exception of Monocle 2 and Wishbone. In Monocle 2 we set norm_method = "none" was used to ensure differing normalisation methods weren't responsible for different results. In the case of Wishbone, we found poor performance using the default parameters on our simulated data. After several attempts to get the best fit, we found the parameters that worked best were setting the number of nearest neighbours in the diffusion map calculation to 25% the total number of cells, the number of nearest neighbours in the trajectory analysis to 20% the number of cells, the number of way-points to 20% the number of cells, and using the first two diffusion components.
In all cases input data were the log-transformed counts normalised by total library size, with a pseudocount of 1 added to avoid divergence issues when x → 0.
Differential expression Differential expression was performed with PhenoPath, Limma voom, ( [13]) DE-Seq2, and MAST [14]. Raw counts were used for Limma voom and DESeq2 (as recommended), while lognormalised values were provided to MAST which is designed to work with log(TPM + 1). In all cases (with the exception of PhenoPath where differential expression testing is implicit in model inference), the models of expression~x:pseudotime (interaction) was compared to the nested model expression~x + pseudotime (no interaction). Default parameters were used for all methods.

AUC calculation
Areas under the receiver-operator curves (AUCs) where calculated using the AUC R package. For differential expression analyses that report p-values, 1 − p−value was used to rank genes for the likelihood of exhibiting a covariate interaction. For PhenoPath, we took the probability of observing 0 under the approximating distribution q(m βpg , s βpg ) as inversely ranking genes for covariate interactions, and so 1− this value was used for input to the AUC function.
Simulation parameters The simulation parameters are described in Supplementary Table 3. In total, 6 different percentages of genes showing interactions were simulated across 2 noise regimes, 2 different cell numbers, with 40 replicates in each condition. Pseudotime inference using 4 different algorithms was then performed on this, with differential expression then performed using 4 different algorithms on the non PhenoPath pseudotimes. In total this gives different differential expression workflows, representing a comprehensive benchmarking of PhenoPath. The overall simulation and inference workflow can be seen in Supplementary Fig. 5b.

Modelling zero-inflation
In the simulations where we modelled zero inflation the overall strategy is to set a proportion p ∈ {0.05, 0.1, 0.2, 0.5, 0.8, 0.9} of the non-zero counts to zero. In every dataset simulated (of raw counts), there is an existing proportion of genes b which are already zero due to being simulated using a negative-binomial likelihood. If there are N G counts in total (for N cells and G genes), the original proportion of zero counts was b, while the new number of zero counts is bN G + (1 − b)pN G. Therefore, we choose p of the (1 − b)N G to be set to zero.
Zero inflation in scRNA-seq data is a non-random process that depends on the latent true expression. In particular, as the expression increases, the probability of observing a zero falls. This process can be wellmodelled using logistic regression [15]. Using an unpublished scRNA-seq dataset of thymic epithelial cells, we computed the empirical dropout probability of each gene (proportion of zero reads) along with the sample mean, to which we fitted a logistic regression model in R, giving an intercept of 1.75 and expression coefficient of −0.04. Therefore, when choosing the p(1 − b)N G counts to set to zero, the probability a particular count at level x is selected is given by p set to zero =

Simulation study
We first compared accuracy of pseudotime inference across the range of conditions and pseudotime algorithms. The summarised results for 200 cells and the low noise case can be seen in Supplementary Fig. 5c. For a low fraction of genes exhibiting interactions between the covariate and latent space (5-10%) Monocle However, as the fraction of genes exhibiting interactions exceeds 10% the ability of PhenoPath to handle multiple covariates clearly becomes beneficial. The accuracy of PhenoPath's pseudotime inference is independent of the underlying fraction of genes exhibiting interactions between the covariate and latent space, while the performance of all other algorithms reduces drastically as this fraction increases. This pattern is evident across all benchmarking configurations studied ( Supplementary Fig. 6). Since this fraction is unknown a priori we suggest it is necessary to use PhenoPath or a similar covariate-adjusted latent variable model to perform inference in such cases.
We then compared the ability of each workflow to detect covariate-trajectory interactions (i.e. does the status of the covariate affect how the gene changes over pseudotime?). For each dataset generated and each pseudotime fit for each dataset we performed differential expression analyses to detect covariate-trajectory interactions. For PhenoPath pseudotime we used PhenoPath's estimate of interaction significance as this is jointly estimated along with the pseudotimes. The accuracy of each workflow's ability to detect covariate-trajectory interactions was assessed using the area under the receiver-operator curve (AUC).
The results for 200 cells with the low noise regime and differential expression performed using Limma Voom can be seen in Supplementary Fig. 5d. On this simulated dataset (where the interactions have large effect sizes) PhenoPath and differential expression using Monocle 2 and TSCAN pseudotimes performs with near perfect AUCs (median AUC across 40 replicates = 1). However, as the fraction of genes exhibiting covariatetrajectory interactions increases, this performance reduces drastically for all workflows other than PhenoPath. Indeed, this pattern is replicated across all experiments and differential expression algorithms with both high noise ( Supplementary Fig. 7) and low noise ( Supplementary Fig. 8). As before, given the underlying fraction of genes that exhibits covariate-trajectory interactions is unknown a priori we suggest it is necessary to use PhenoPath for such analyses.
We were puzzled to discover some AUCs less than 0.5, apparently showing workflows performing worse than at random. We were initially worried that this was due to an error in implementation (though the fact that most workflows have AUCs near 1 when the pseudotimes are approximately correct acts as a positive control) so investigated one of the worst performing cases where pseudotime was inferred with Monocle 2 and differential expression performed with MAST.
In this dataset we examined one gene that had no interaction simulated but which the workflow identified has showing an interaction. The expression is shown against the true and inferred pseudotimes in Supplementary  Fig. 9a&b respectively. It demonstrates that the inferred pseudotime in fact runs from samples with one covariate status into those of another. The fact that Monocle 2 attempts to find a trajectory with a degree of smoothness means these are joined at a "peak" in expression, resulting in the gradient of change over pseudotime of the gene for each covariate status being different and thus an interaction inferred. Essentially Monocle 2 has taken the pseudotime for one covariate status, flipped it, and rejoined the two separately where the expression is most similar.
We can then examine what happens to genes with covariate-dependent regulation along the trajectory, an example of which is shown in Supplementary Fig. 9c. The consequence of Monocle 2 reversing the pseudotimes for samples corresponding to one covariate status is that samples for both covariate statuses now have identical gradients over pseudotime, meaning no interaction is picked up using standard differential expression analyses. Therefore, we see how not only does performing separate pseudotime-DE analyses workflows for covariate-trajectory interactions reduce the power to detect interactions, it in fact means we can perform worse than by random guessing due to the characteristics of the trajectories inferred.
It is also worth emphasising the ease of which each analysis is performed. PhenoPath performs trajectory inference and detection of covariate-trajectory interactions with a call to a single R function. All other workflows we tested required the implementation of a bespoke bioinformatics pipeline.

Robustness to initialisation and hyperparameters
We sought to identify how robust PhenoPath is to the choice of hyperparameters and initialisation of the latent space z. An exhaustive analysis of all configurations is impractical, so we varied four different quantities and compared the correlation in the inferred pseudotimes to those using the default PhenoPath values (section 1.2) for the mouse dendritic cell dataset [1]. Note that none of the values tested coincide the defaults. These four quantities were: 1. Percent change in the ELBO for optimisation to be considered converged, which we ranged across the values 10 −3 , 10 −4 , 10 −6 . Note that since PhenoPath implements co-ordinate ascent variational inference with exact updates for each parameter, this is the only free choice with respect to the variational inference procedure 2. The hyperparameter τ α , with values 0.1, 2, 5 3. The initialisation of z, which we took to be the capture times (scaled to have mean 0 standard deviation 1), initialisation from a different pseudotime algorithm (Monocle 2), and the first principal component of the data corrupted by additive N (0, 1) noise 4. The quantity a β b β , which controls the mean of the Gamma prior on χ, which in turn controls the shrinkage on the interaction coefficients β. The variance of this distribution was fixed to 10.
In total this gives 3 4 = 81 hyperparameter/initialisation conditions that were used to fit the mouse dendritic cells dataset [1]. The pseudotimes reported by these fits were correlated with the pseudotimes using the default PhenoPath values, the results of which can be seen in Supplementary Fig. 10. Across all parameter combinations, the minimum correlation to the default values is 0.9993, suggesting PhenoPath is highly robust to the initialisation of the latent space, hyperparameters, and variational inference algorithm.

Single-cell pseudotime is approximately linear
One common theme in many pseudotime inference algorithms is the emphasis on learning a non-linear manifold embedded in the high dimensional space (see e.g. [16,17]) which could in theory decrease the accuracy of trajectories inferred with PhenoPath that assumes linear changes in expression over (pseudo-)time. Since the pseudotimes are always unobserved it is impossible to precisely quantify the true non-linearity of the "pseudotemporal manifold". However, we can fit trajectories and compare the results with those inferred using the "nonlinear" algorithms.
In the majority of cases the correlation of the pseudotime inferred to the first principal component of the dataset exceeds 0.8, with the minimum value still greater than 0.5. We further fitted a linear model (using lm in R) for each gene as a function of the pseudotime derived for each algorithm in each dataset and examined the proportion of the transcriptome that has a significant linear trend (where significant is defined as the Benjamini-Hochberg corrected p-value for the linear coefficient being less than 0.05). Supplementary Fig.  11b shows that for all datasets and algorithms at least half the transcriptome exhibits an approximately linear trend.

PhenoPath is robust to highly variable gene selection
The trajectory inferred by pseudotime algorithms is obviously dependent on the genes used. A common approach is to select a subset of highly variable genes (HVGs) ( [23]), though the precise number of HVGs to use is of course subjective and will affect the pseudotime inferred.
We tested the extent to which PhenoPath is robust to the number of HVGs used as input to the algorithm. For the three datasets we studied we varied the number of HVGs used and examined the correlation to the latent space derived using the 1000 HVGs, the results of which can be seen in Supplementary Fig. 12.
Across each dataset we PhenoPath was generally robust to the number of HVGs selected, with the lowest reported correlation of 0.45. We repeated this exercise using one of the leading pseudotime methods (Monocle 2, [16]) to understand the extent to which such variability is inherent to all pseudotime algorithms (note that the correlation reported in this case is to the Monocle 2 pseudotime on 1000 HVGs). We found that Monocle 2 was also sensitive to the set of HVGs included in the analysis.

Demonstration of PhenoPath with categorical variables
One of PhenoPath's strengths is its ability to work with arbitrary design matrices incorporating binary, categorical, or continuous covariates. For this we simply construct a design matrix using the model.matrix command in R (with no intercept as this is handled by the λ parameter) and pass this in as x to the model, e.g. for three categories we would make the design matrix with for some expression matrix Y. We performed a small simulation study on toy data to demonstrate PhenoPath's ability to handle such input matrices. We simulated data for 80 genes and 100 cells from the PhenoPath mean function, each of which belonged to one of three different "types", and added N (0, 1) noise. PCA plots of the expression data coloured by pseudotime and covariate status may be seen in Supplementary Fig. 13. We subsequently re-inferred using the phenopath() function in R passing in a design matrix as described above with all other parameters left as default.
The results can be seen in Supplementary Fig. 14. Supplementary Fig. 14a shows that inference of the latent z values is highly accurate when compared to the true z values. We further examined the estimated β values compared to the true simulated values, which exhibits high correlation with slight shrinkage due to the automatic relevance determination (ARD) prior. However, the significance test for this is well calibrated, exhibiting high sensitivity and specificity.
Note that categorical covariates must be converted to a one-hot encoding matrix that can substantially increase the number of parameters to be inferred. If a categorical variable has M levels then M ×G additional variables are introduced for G genes.

Comparison to fitting separate pseudotimes for each covariate status
An alternative approach to that of the PhenoPath model is to split the samples based on covariate-status and perform pseudotime inference separately on each before combining the results post-hoc. There are several downsides to such an approach. Examining each set of cells separately leads to smaller numbers of samples per test and therefore a reduction in power to detect interactions. Furthermore, if the covariate is continuous one would have to resort to arbitrary binning of samples to perform such an analyses. This becomes even more burdensome if multiple covariates are present. For example, if x 1 has levels A and B and x 2 has levels C and D we would have to consider four different groups of samples (AC, AD, BC, BD) -and fit the pseudotimes separately for each. The number of groups grows exponentially in the number of factors, meaning many groups will actually have few or no samples in it. PhenoPath circumvents all of these issues by default as part of its integrated model.
Further subtle issues arise with respect to biological interpretation. Upon splitting the samples, how do you know the inferred pseudotimes correspond to the same biological process? A recent study has looked into this (preprinted after our original submission, [24]) that uses dynamic time warping to "align" multiple trajectories. This approach will not work if the covariate is continuous, becomes combinatorially hard with the number of levels if x is categorical, and is taken care of by default in the PhenoPath model by modelling a common process z for all samples.
To demonstrate some of the downsides to such a split-analysis approach we took the mouse dendritic cells dataset [1] and performed pseudotime inference separately for each stimulant status (LPS and PAM) using the 5000 HVG. Our first observation was that the R 2 to capture time using Monocle was 0.64 and 0.55 for LPS and PAM respectively, compared to 0.70 for PhenoPath.
It is worth considering the difficulties in interpretation of interactions when departing from linear models. Consider the example of the gene Tnf from the mouse dendritic cells dataset [1], that is downregulated under LPS stimulation but upregulated under PAM. If we split the dataset and perform DE analysis, Tnf will be significant under both, from which we could naively conclude there is no difference in regulation between LPS and PAM. If using the default B-spline basis for Monocle 2 we could then turn to the coefficients to see if there is any difference between them. However, the B-spline basis is nonparametric in nature making coefficients hard to compare between two different datasets in an intuitive manner. We therefore restrict this analysis to linear changes in expression over pseudotime, fitting models of the form y = β Stimulant z and comparinĝ β LPS −β PAM to test for interactions.
We performed this analysis on the mouse dendritic cells dataset [1] as described above and compared the inferredβ LPS −β PAM to the PhenoPath β interaction parameters ( Supplementary Fig. 15a), finding virtually no correlation. After further analyses we discovered that the pseudotime for LPS cells was orientated "backwards" with respect to true time (which is fine since all pseudotimes are equivalent up to a parity transformation), and upon "re-orientating" it and performing the analysis again we found the estimates in good agreement ( Supplementary Fig. 15a).
While correcting the initially wrong orientation was easy given we had ground-truth capture times, such a step is almost impossible if we do not. Given the orientation is essentially random, there is a 50% chance that one trajectory will be orientated the wrong way compared to another, which obviously increases as the number of trajectories we are trying to integrate does too. Furthermore we cannot appeal to steps such as identifying which orientation gives the most coefficients in common, as it is precisely differences here we are searching for in the first place. Finally, we note that while we can examineβ LPS −β PAM to rank interactions this does not provide a means of testing for significant interactions without complex mathematics to transform the confidence intervals. PhenoPath circumvents all these issues by using an integrated model.

Effect of zero-inflation (dropout) on PhenoPath
We wished to quantify the effect of single-cell dropout on PhenoPath when PhenoPath was used with singlecell data. Using the same simulation scheme as before, we simulated (with the "high" noise regime) datasets where a fraction p ∈ {0.05, 0.1, 0.2, 0.5, 0.8, 0.9} of the non-zero counts are set to zero, with 40 replicates in each condition (see section 1.4). We then re-inferred the pseudotimes using PhenoPath.
The results can be seen in Supplementary Fig. 16. The ability of PhenoPath to infer pseudotime is largely unaffected by single-cell dropout until > 90% of non-zero counts are set to zero. We believe this is due to the non-random nature of single-cell dropout -lowly expressed genes are more likely to be zeroed than highly expressed genes, so in general inference will be robust to say a count of log(TPM + 1) = 1 being set to zero than a count of log(TPM + 1) = 10 being set to zero.

BRCA Survival Analysis
We fitted a stratified (ER status) Cox proportional hazards model to the overall survival data for 720 TCGA BRCA patients with survival and expression data using patient age at onset and PhenoPath pseudotime as covariates. This survival analysis indicated that the model coefficient associated with the pseudotime contribution was significant (p = 0.0032). Analysis of deviance between nested models, with and without pseudotime as a covariate, indicated that the performance of the more complex model that includes pseudotime produces a better fit to the survival data p = 0.004124. Proportional hazards tests and diagnostics based on weighted residuals was performed to confirm that the proportion hazards assumption was not violated. Supplementary  Fig. 17 suggests that under both ER+ and ER-, increasing pseudotime progression leads to reduced overall survival.

Identifying crossover points in BRCA
In PhenoPath we model gene expression evolving along the trajectories separately for each phenotype (or covariate) considered. Unless the gradient of change along the trajectory is exactly equal for both phenotypes (i.e. β = 0 exactly), the gene expression will cross at a given point in the trajectory.
Inference of this point would allow us to identify sections of the trajectory not affected by the covariate and consequently sections of the trajectory that are. This is important as if the crossover point occurs towards the beginning of the trajectory, it would mean gene expression is similar at the beginning but diverges as we move along the trajectory. Similarly, if the crossover points occur towards the end of the trajectory, it would imply the expression profiles for the two phenotypes are different at the beginning of the trajectory, but converge as the trajectory progresses. An interpretation of this would be that the effect on expression from the trajectory slowly dominates over the effect of phenotypes on the trajectory.
It is important to note that the latent trajectory values loosely follow a N(0, 1) distribution. This means the 'middle' of the trajectory is any value around zero, values of -1 or less could be thought of as the 'beginning' while values greater than 1 may be thought of as the 'end'. Crucially, we can derive an analytical expression from the PhenoPath parameters for the crossover point z * .
The condition for the crossover point is that the predicted expression for each phenotype is identical. Therefore (in the context of BRCA cancer) which leads to the condition which is in turn solved by We fitted the crossover points z * for all significant genes in the BRCA dataset. We find that the vast majority of the crossover times z * occur towards the end of the trajectory, with a median value of around 0.4. In other words, at the beginning of the trajectory most genes are differentially expressed based on ER status, while as the trajectory progresses it comes to dominate at the gene expression converges.