Abstract
We present dyngen, a multimodal simulation engine for studying dynamic cellular processes at singlecell resolution. dyngen is more flexible than current singlecell simulation engines, and allows better method development and benchmarking, thereby stimulating development and testing of computational methods. We demonstrate its potential for spearheading computational methods on three applications: aligning cell developmental trajectories, cellspecific regulatory network inference and estimation of RNA velocity.
Introduction
Singlecell simulation engines are becoming increasingly important for testing and benchmarking computational methods, a pressing need in the widely expanding field of singlecell biology. Complementary to real biological data, synthetic data provides a valuable alternative where the actual ground truth is completely known and thus can be compared to, in order to make quantitative evaluations of computational methods that aim to reconstruct this ground truth^{1}. In addition, simulation engines are more flexible when it comes to stresstesting computational methods, for example by varying the parameters of the simulation, such as the amount of noise, samples, and cells measured, allowing benchmarking of methods over a wide range of possible scenarios. In this way, they can even guide the design of real biological experiments, finding out the best conditions to be used as input for subsequent computational pipelines.
Another, more experimental use of simulation engines is their important role in spearheading the development of computational methods, possibly even before real data is available. In this way, simulation engines can be used to assess the value of novel experimental protocols or treatments. Simulation engines are also increasingly important when it comes to finding alternatives to animal models, for example for drug testing and precision medicine. In such scenarios, cellular simulations can act as digital twins, offering unlimited experimentation in silico^{2}.
Simulating realistic data requires that the underlying biology is recapitulated as best as possible, and in the case of transcriptomics data this typically involves modelling the underlying gene regulatory networks. Simulators of “bulk” microarray or RNAsequencing profiles simulate biological processes (e.g. transcription, translation) by translating a database of known regulatory interactions into a set of ordinary differential equations (ODE)^{3,4,5,6}. These methods have been instrumental in performing benchmarking studies^{7,8,9}. However, the advent of singlecell omics introduced several new types of analyses (e.g. trajectory inference, RNA velocity, cellspecific network inference) which exploit the higher resolution of singlecell versus bulk omics^{10}. In addition, the data characteristics of singlecell omics are vastly different from bulk omics, typically having much lower library sizes and a higher dropout rate, but also a high number of profiles^{11}. The low library sizes, in particular, are problematic as ODEs are illsuited for performing lowmolecule simulations^{12}. This necessitates the development of singlecell simulators.
To this end, singlecell omics simulators emulate the technical procedures from singlecell omics protocols. Simulators such as Splatter^{1}, powsimR^{13}, PROSSTT^{14}, and SymSim^{15}) have already been widely used to compare singlecell methods^{16,17,18,19} and perform independent benchmarks^{20,21,22}. However, by focusing more on simulating the singlecell omics protocol (e.g. RNA capture, amplification, sequencing) and less on the underlying biology (e.g. transcription, splicing, translation), their applicability and reusability is limited towards the specific application for which they were designed (e.g. benchmarking clustering or differential expression methods), and extending these tools to include additional modalities or experimental conditions is challenging.
We introduce dyngen, a method for simulating cellular dynamics at a singlecell, singletranscript resolution (Fig. 1). This problem is tackled in three fully configurable main steps. First, biological processes are mimicked by translating a gene regulatory network into a set of reactions (regulation, transcription, splicing, translation). Second, individual cells are simulated using Gillespie’s stochastic simulation algorithm (SSA)^{12}, which is designed to work well in lowmolecule simulations. Finally, real reference datasets are used to emulate singlecell omics profiling protocols.
Throughout a simulation, dyngen tracks many layers of information, including the abundance of any molecule in the cell, the progression of the cell along a dynamic process, and the activation strength of individual regulatory interactions. In addition, dyngen can simulate a large variety of dynamic processes (e.g. cyclic, branching, disconnected) as well as a broad range of experimental conditions (e.g. batch effects and timeseries, perturbation and singlecell knockdown experiments). For these reasons, dyngen can cater to a wide range of benchmarking applications, including trajectory inference, trajectory alignment, and trajectory differential expression (Supplementary Table 1).
Results
We demonstrate dyngen’s broad applicability by evaluating three types of computational approaches for which no simulation engines exist yet: cellspecific network inference, trajectory alignment and RNA velocity (Fig. 2). We emphasise that our main aim here is to illustrate the potential of dyngen for these evaluations, rather than performing largescale benchmarking, which would require assessing many more quantitative and qualitative aspects of each method^{23}.
Usecase “trajectory alignment”. Trajectory alignment methods align trajectories from different samples and allow studying the differences between the different trajectories. For example, by comparing the transcriptomic profiles of cells from a diseased patient to a healthy control, it might be possible to detect transcriptomics differences (differential expression) of particular cells along a developmental process, or to detect an early stop of the trajectory of the diseased patient. Currently, trajectory alignment is limited to aligning linear trajectories, though other topologies of a trajectory could be aligned as well. Dynamic Time Warping (DTW)^{24} is a method designed for aligning temporal sequences for speech recognition but has since been used to compare gene expression kinetics from many different biological processes^{25,26,27,28}. cellAlign^{28} uses DTW to perform trajectory alignment, but also includes interpolation and scaling of the singlecell data as a preprocessing step. We evaluate the performance of DTW and cellAlign by simulating 40 datasets, each containing two linear trajectories generated with the same gene regulatory network but with slightly different simulation kinetics. We assess the accuracy of the obtained alignments by comparing the generated alignment path with the worst possible alignment that could be performed (Supplementary Fig. 1D), named the Area Between Worst And Prediction (ABWAP) score. Overall, cellAlign performs significantly better than DTW (Supplementary Fig. 1), which is likely due to the interpolation and scaling steps provided by cellAlign, reducing noise in the data and improving the comparability of the trajectories. Note that, in this comparison, only linear trajectory alignment is performed. While dyngen can generate nonlinear trajectories (e.g. cyclic or branching), both aligning nonlinear trajectories and constructing a quantitative accuracy metric for nonlinear trajectory alignment is not trivial and an avenue for future work.
Usecase “RNA velocity”. RNA velocity methods use the relative ratio between premRNA and mature mRNA reads to predict the rate of increase/decrease of RNA molecule abundance, as this can be used to predict the directionality of singlecell differentiation in trajectories^{29,30}. Already two algorithms are currently available for estimating the RNA velocity vector from spliced and unspliced counts: velocyto^{30} and scvelo^{31}. Yet, to date, no quantitative assessment of their accuracy has been performed, mainly due to the difficulty in obtaining real groundtruth data to do so. In contrast, the groundtruth RNA velocity can be easily extracted from a dyngen simulation, as it is possible to store the rate at which mRNA molecules are being transcribed and degraded at any particular point in time. We executed velocyto and scvelo (with 2 different parameter settings, stochastic and dynamical) on 42 datasets with a variety of backbones (including linear, bifurcating, cyclic, disconnected). We evaluated the predictions using two metrics (Supplementary Fig. 2), one which directly compares the predicted RNA velocity of each gene with the groundtruth RNA velocity (called the “velocity correlation”), and one which compares the direction of the groundtruth trajectory embedded in a dimensionality reduction with the average RNA velocity of cells in that neighbourhood (called the “velocity arrow cosine”). While both velocyto and scvelo obtained high scores for the velocity arrow cosine metric (overall 25th percentile = 0.606), the velocity correlation is rather low (overall 75th percentile = 0.156). This means that predicting the RNA velocity (i.e. transcription rate minus the decay rate) for particular individual genes can be challenging, but the combined information is very informative in determining the directionality of cell progression in the trajectory. In terms of velocity correlation, no method performed significantly better than the other, whereas “scvelo stochastic” performed slightly worse than “scvelo dynamical” and velocyto in terms of velocity arrow cosine score. Note that, given that some genes are more informative in determining the overall directionality of cell progression, performing a feature selection before computing the embedded dimensionality reduction might result in significantly improved velocity arrow cosine scores.
Usecase “Cellspecific network inference” (CSNI). CSNI methods predict not only which transcription factors regulate which target genes, but also aim to identify how active each interaction is in each of the cells, since interactions can be turned off and on depending on the cellular state. While a few pioneering CSNI approaches have already been developed^{32,33,34}, a quantitative assessment of their performance is until now lacking. This is not surprising, as neither real nor in silico datasets of cellspecific or even celltypespecific interactions exist that are large enough so that it can be used as a groundtruth for evaluating CSNI methods. Extracting the groundtruth dynamic network in dyngen is straightforward though, given that we can calculate how target gene expression would change without the regulator being present. We used this groundtruth to compare the performance of three CSNI methods (Supplementary Fig. 3): LIONESS^{33}, SSN^{34}, and SCENIC^{32}. For each dataset, we computed the mean Area Under the Receiver Operating Characteristiccurve (AUROC) and Area Under the PrecisionRecall curve (AUPR) scores of the individual cells. Comparing the mean AUROC and AUPR showed that pySCENIC significantly outperforms both LIONESS and SSN, and in turn that LIONESS significantly outperforms SSN. The poor performance of SSN is expected, as its methodology for predicting a cellspecific is simply computing the difference in Pearson correlation values applied to the whole dataset and the whole dataset minus one sample. This strategy performs poorly in large datasets where cell correlations are high, as the removal of one cell will not yield large differences in correlation values and will result in mostly noise. Overall, pySCENIC almost always performs better than LIONESS, except for a few datasets where LIONESS does manage to obtain a higher AUROC score. However, by using a different internal network inference (e.g. GENIE3^{35} or pySCENIC’s GRNBoost2^{36}) could significantly increase the performance obtained by LIONESS.
Discussion
dyngen’s singlecell simulations can be used to evaluate common singlecell omics computational methods such as clustering, batch correction, trajectory inference, and network inference. However, the framework is flexible enough to be adaptable to a broad range of applications, including methods that integrate clustering, network inference, and trajectory inference. In this respect, dyngen may promote the development of tools in the singlecell field similarly as other simulators have done in the past^{5,37}. Additionally, one could anticipate technological developments in singlecell multiomics. In this way, dyngen allows designing and evaluating the performance and robustness of new types of computational analyses before experimental data becomes available, comparing which experimental protocol is the most costeffective in producing qualitative and robust results in downstream analysis. One major assumption of dyngen is that cells are assumed to be wellmixed and independent from each other. Subdividing a cell into multiple 2D or 3D subvolumes or allowing cells to exchange molecules, respectively, could pave the way to better study key cellular processes such as cell division, intercellular communication, and migration^{38}.
Methods
The workflow to generate in silico singlecell data consists of six main steps (Supplementary Fig. 4).
Defining the module network
One of the main processes involved in cellular dynamic processes is gene regulation, where regulatory cascades and feedback loops lead to progressive changes in expression and decision making. The exact way a cell chooses a certain path during its differentiation is still an active research field, although certain models have already emerged and been tested in vivo. One driver of bifurcation is mutual antagonism, where two genes strongly repress each other^{39,40}, forcing one of the two to become inactive^{41}. Such mutual antagonism can be modelled and simulated^{42,43}. Although the twogene model is simple and elegant, the reality is frequently more complex, with multiple genes (grouped into modules) repressing each other^{44}.
To start a dyngen simulation, the user needs to define a module network. The module network describes how sets of genes regulate each other and is what mainly determines which dynamic processes occur within the simulated cells.
A module network consists of modules connected together by regulatory interactions, which can be either upregulating or downregulating. A module may have basal expression, which means genes in this module will be transcribed without the presence of transcription factor molecules. A module marked as “active during the burn phase” means that this module will be allowed to generate expression of its genes during an initial warmup phase. At the end of the dyngen process, cells will not be sampled from the burn phase simulations. Interactions between modules have a strength (which is a positive integer) and an effect (+1 for upregulating, −1 for downregulating).
Several examples of module networks are given in Supplementary Fig. 5. A simple chain of modules (where one module upregulates the next) results in a linear process. By having the last module repress the first module, the process becomes cyclic. Two modules repressing each other is the basis of a bifurcating process, though several chains of modules have to be attached in order to achieve progression before and after the bifurcation process. Finally, a converging process has a bifurcation occurring during the burn phase, after which any differences in module regulation is removed.
Note that these examples represent the bare minimum in terms of the number of modules used. Using longer chains of modules is typically desired. In addition, the fate decisions made in this example of a bifurcation is reversible, meaning cells can be reprogrammed to go down a different differentiation path. If this effect is undesirable, more safeguards need to be put in place to prevent reprogramming from occurring.
Generating the gene regulatory network
The GRN is generated based on the given module network in four main steps (Supplementary Fig. 6).
Step 1, sampling the transcription factors (TF). The TFs are the main drivers of the molecular changes in the simulation. The user provides a backbone and the number of TFs to generate. Each TF is assigned to a module such that each module has at least x parameters (default x = 1). A TF inherits the ‘burn’ and ‘basal expression’ from the module it belongs to.
Step 2, generating the TF interactions. Let each TF be regulated according to the interactions in the backbone. These interactions inherit the effect, strength, and independence parameters from the interactions in the backbone. A TF can only be regulated by other TFs or itself.
Step 3, sampling the target subnetwork. A userdefined number of target genes are added to the GRN. Target genes are regulated by a TF or another target gene, but are always downstream of at least one TF. To sample the interactions between target genes, one of the many FANTOM5^{45} GRNs is sampled. The currently existing TFs are mapped to regulators in the FANTOM5 GRN. The targets are drawn from the FANTOM5 GRN weighted by their page rank value, to create an induced GRN. For each target, at most x regulators are sampled from the induced FANTOM5 GRN (default x = 5). The interactions connecting a target gene and its regulators are added to the GRN.
Step 4, sampling the housekeeping subnetwork. Housekeeping genes are completely separate from any TFs or target genes. A userdefined set of housekeeping genes is also sampled from the FANTOM5 GRN. The interactions of the FANTOM5 GRN are first subsampled such that the maximum indegree of each gene is x (default x = 5). A random gene is sampled and a breadthfirstsearch is performed to sample the desired number of housekeeping genes.
Convert gene regulatory network to a set of reactions
Simulating a cell’s GRN makes use of a stochastic framework which tracks the abundance levels of molecules over time in a discrete quantity. For every gene G, the abundance levels of three molecules are tracked, namely of corresponding premRNAs, mature mRNAs and proteins, which are represented by the terms x_{G}, y_{G}, and z_{G} respectively. The GRN defines how a reaction affects the abundance levels of molecules and how likely it will occur. Gibson and Bruck^{46} provide a good introduction to modelling gene regulation with stochastic frameworks, on which many of the concepts below are based.
For every gene in the GRN a set of reactions are defined, namely transcription, splicing, translation, and degradation. Each reaction consists of a propensity function—a formula f(.) to calculate the probability f(.) × dt of it occurring during a time interval dt—and the effect—how it will affect the current state if triggered.
The effects of each reaction mimic the respective biological processes (Supplementary Table 2, middle). Transcription of gene G results in the creation of a single premRNA molecule x_{G}. Splicing turns one premRNA x_{G} into a mature mRNA x_{G}. Translation uses a mature mRNA y_{G} to produce a protein z_{G}. PremRNA, mRNA, and protein degradation results in the removal of a x_{G}, y_{G}, and z_{G} molecule, respectively.
The propensity of all reactions except transcription are all linear functions (Supplementary Table 2, right) of the abundance level of some molecule multiplied by a pergene constant (Supplementary Table 3). The propensity of transcription of a gene G depends on the abundance levels of its TFs. The pergene and perinteraction constants are based on the median reported productionrates and halflives of molecules measured of 5000 mammalian genes^{47}, except that the transcription rate has been amplified by a factor of 10.
The propensity of the transcription of a gene G is inspired by thermodynamic models of gene regulation^{48}, in which the promoter of G can be bound or unbound by a set of N transcription factors H_{i}. Let f(z_{1}, z_{2}, …, z_{N}) denote the propensity function of G, in function of the abundance levels of the transcription factors. The following subsections explain and define the propensity function when N = 1, N = 2, and finally for an arbitrary N.
Propensity of transcription when N = 1
In the simplest case when N = 1, the promoter can be in one of two states. In state S_{0}, the promoter is not bound by any transcription factors, and in state S_{1} the promoter is bound by H_{1}. Each state S_{j} is linked with a relative activation α_{j}, a number between 0 and 1 representing the activity of the promoter at this particular state. The propensity function is thus equal to the expected value of the activity of the promoter multiplied by the premRNA production rate of G.
For N = 1, P(S_{1}) is equal to the Hill equation, where k_{i} represents the concentration of H_{i} at halfoccupation and n_{i} represents the Hill coefficient. Typically, n_{i} is between [1, 10]
The Hill equation can be simplified by letting \({\nu }_{i}={\left(\frac{{y}_{i}}{{k}_{i}}\right)}^{{n}_{i}}\).
Since P(S_{0}) = 1 − P(S_{1}), the activation function is formulated and simplified as follows.
Propensity of transcription when N = 2
When N = 2, there are four states S_{j}. The relative activations α_{j} can be defined such that H_{1} and H_{2} are independent (additive) or synergistic (multiplicative). In order to define the propensity of transcription f(.), the Hill equation P(S_{j}) is extended for two transcription factors.
Let w_{j} be the numerator of P(S_{j}), defined as the product of all transcription factors bound in that state:
The denominator of P(S_{j}) is then equal to the sum of all w_{j}. The probability of state S_{j} is thus defined as:
Substituting P(S_{j}) and w_{j} into f(.) results in the following equation:
Propensity of transcription for an arbitrary N
For an arbitrary N, there are 2^{N} states S_{j}. The relative activations α_{j} can be defined such that H_{1} and H_{2} are independent (additive) or synergistic (multiplicative). In order to define the propensity of transcription f(.), the Hill equation P(S_{j}) is extended for N transcription factors.
Let w_{j} be the numerator of P(S_{j}), defined as the product of all transcription factors bound in that state:
The denominator of P(S_{j}) is then equal to the sum of all w_{j}. The probability of state S_{j} is thus defined as:
Substituting P(S_{j}) into f(.) yields:
Propensity of transcription for a large N
For large values of N, computing f(.) is practically infeasible as it requires performing 2^{N} summations. In order to greatly simplify f(.), α_{j} could be defined as 0 when one of the regulators inhibits transcription and 1 otherwise.
Substituting Eq. (23) into Eq. (22) and defining R = {1, 2, …, N} and R^{+} = {i∣H_{i}activatesG} yields the simplified propensity function:
Independence, synergism, and basal expression
The definition of α_{j} as in Eq. (23) presents two main limitations. Firstly, since α_{0} = 1, it is impossible to tweak the propensity of transcription when no transcription factors are bound. Secondly, it is not possible to tweak the independence and synergism of multiple regulators.
Let ba ∈ [0, 1] denote the basal expression strength G (i.e. how much will G be expressed when no transcription factors are bound), and sy ∈ [0, 1] denote the synergism of regulators H_{i} of G, the transcription propensity becomes:
Simulate single cells
dyngen uses Gillespie’s SSA^{12} to simulate dynamic processes. An SSA simulation is an iterative process where at each iteration one reaction is triggered.
Each reaction consists of its propensity—a formula to calculate the probability of the reaction occurring during an infinitesimal time interval—and the effect—how it will affect the current state if triggered. Each time a reaction is triggered, the simulation time is incremented by \(\tau =\frac{1}{{\sum }_{j}{{\rm{prop}}}_{j}}{\mathrm{ln}}\,\left(\frac{1}{r}\right)\), with r ∈ U (0, 1) and prop_{j} the propensity value of the jth reaction for the current state of the simulation.
GillespieSSA2 is an optimised library for performing SSA simulations. The propensity functions are compiled to C++ and SSA approximations can be used which allow triggering many reactions simultaneously at each iteration. The framework also allows storing the abundance levels of molecules only after a specific interval has passed since the previous census. By setting the census interval to 0, the whole simulation’s trajectory is retained but many of these time points will contain very similar information. In addition to the abundance levels, also the propensity values and the number of firings of each of the reactions at each of the time steps can be retained.
Simulate experiment
From the SSA simulation we obtain the abundance levels of all the molecules at every state. We need to replicate technical effects introduced by experimental protocols in order to obtain data that is similar to real data. For this, the cells are sampled from the simulations and molecules are sampled for each of the cells. Gene capture rates and library sizes are empirically derived from real datasets to match real technical variation.
Sample cells
In this step, N cells are sampled from the simulations. Two approaches are implemented: sampling from an unsynchronised population of single cells (snapshot) or sampling at multiple time points in a synchronised population (time series).
Snapshot. The backbone consists of several states linked together by transition edges with length L_{i}, to which the different states in the different simulations have been mapped (Supplementary Fig. 7A). From each transition, \({N}_{i}=N/\frac{{L}_{i}}{\sum {L}_{i}}\) cells are sampled uniformly, rounded such that ∑N_{i} = N.
Time series. Assuming that the final time of the simulation is T, the interval [0, T] is divided into k equal intervals of width w separated by k − 1 gaps of width g. N_{i} = N/k cells are sampled uniformly from each interval (Supplementary Fig. 7B), rounded such that ∑N_{i} = N. By default, k = 8 and g = 0.75. For usual dyngen simulations, 10 ≤ T ≤ 20. For larger values of T, k and g should be increased accordingly.
Sample molecules
Molecules are sampled from the simulation to replicate how molecules are experimentally sampled. A real dataset is downloaded from a repository of singlecell RNAseq datasets^{49}. For each in silico cell i, draw its library size ls_{i} from the distribution of transcript counts per cell in the real dataset. The capture rate cr_{j} of each in silico molecule type j is drawn from N(1, 0.05). Finally, for each cell i, draw ls_{i} molecules from the multinomial distribution with probabilities cr_{j} × ab_{i,j} with ab_{i,j} the molecule abundance level of molecule j in cell i.
Comparison between a dyngen and a reference dataset
Comparison between a dyngen dataset and the reference dataset it used in terms of characteristic singlecell omics features showed that dyngen produces datasets with highly similar data characteristics (Supplementary Note 1). Supplementary Note 1 was generated using countsimQC^{50}.
Simulating batch effects
Simulating batch effects can be performed in multiple ways. One such way is to perform the first two steps of the creation of a dyngen model (defining the module network and generating the GRN). For each desired batch, create a separate model for which random kinetics are generated and perform all subsequent dyngen steps (convert to reactions, simulate gold standard, simulate single cells, simulate experiment). Since each separate model has different underlying kinetics, the combined output will resemble having batch effects.
Determining the groundtruth trajectory
To construct the groundtruth trajectory, the user needs to provide the groundtruth state network alongside the initial module network (Supplementary Fig. 8). Each edge in the state network specifies which modules are allowed to change in expression in transitioning from one state to another. For each edge, a simulation is run using the end state of an upstream branch as the initial expression vector, and only allowing the modules as predefined by the attribute to change.
As an example, consider the cyclic trajectory shown in Supplementary Fig. 8. State S0 begins with an expression vector of all zero values. To simulate the transition from S0 to S1, regulation of the genes in modules A, B, and C are turned on. After a predefined period of time, the end state of this transition is considered the expression vector of state S1. To simulate the transition from S1 to S2, regulation of the genes in modules D and E are turned on, while the regulation of genes in module C is turned off. During this simulation, the expression of genes in modules A, B, D, and E is thus allowed to change. The end state of the simulation is considered the expression vector of state S2.
For each of the branches in the state network, an expression matrix and the corresponding progression time along that branch are retained. To map a simulated cell to the groundtruth, the correlation between its expression values and the expression matrix of the groundtruth trajectory is calculated, and the cell is mapped to the position in the groundtruth trajectory that has the highest correlation.
Determining the cellspecific groundtruth regulatory network
Calculating the regulatory effect of a regulator R on a target T (Supplementary Fig. 4F) requires determining the contribution of R in the propensity function of the transcription of T with respect to other regulators. This information is useful, amongst others, for benchmarking cellspecific network inference methods.
The regulatory effect of R on T at a particular state S is defined as the change in the propensity of transcription when R is set to zero, scaled by the inverse of the premRNA production rate of T. More formally:
Determining the regulatory effect for all interactions and cells in the dataset yields the complete cellspecific groundtruth GRN. The regulatory effect lies between [−1, 1], where −1 represents complete inhibition of T by R, 1 represents maximal activation of T by R, and 0 represents inactivity of the regulatory interaction between R and T.
Comparison of cellspecific network inference methods
42 datasets were generated using the 14 different predefined backbones and three different seeds. For every cell in the dataset, the transcriptomics profile and the corresponding cellspecific groundtruth regulatory network was determined.
We selected three cellspecific NI methods: SCENIC^{32}, LIONESS^{33,51}, and SSN^{34}.
LIONESS^{33} runs a NI method multiple times to construct cellspecific GRNs. LIONESS first infers a GRN with all of the samples. A second GRN is inferred with all samples except one particular profile. The cellspecific GRN for that particular profile is defined as the difference between the two GRN matrices. This process is repeated for all profiles, resulting in a cellspecific GRN. By default, LIONESS uses PANDA^{52} to infer GRNs, but since dyngen does not produce motif data and motif data is required by PANDA, PANDA is inapplicable in this context. Instead, we used the lionessR^{53} implementation of LIONESS, which uses by default the Pearson correlation as a NI method. We marked results from this implementation as “LIONESS + Pearson”.
SSN^{34} follows, in essence, the exact same methodology as LIONESS except that it specifically only uses the Pearson correlation. It is worth noting that the LIONESS preprint was released before the publication of SSN. Since no implementation was provided by the authors, we implemented SSN in R using basic R and tidyverse functions^{54} and marked results from this implementation as “SSN*”.
SCENIC^{32} consists of four main steps. First, classical network inference is performed with stochastic gradient boosting machines using arboreto^{36}. Second, the top 10 regulators of every target gene are selected. Interactions are grouped together in ‘modules’; each module contains one regulator and all of its targets. Next, the modules are filtered using motif analysis. Finally, for each module and each cell, an activity score is calculated using AUCell. As a postprocessing of this output, all modules and the corresponding activity scores are combined back into a cellspecific GRN consisting of (cell, regulator, target, score) pairs. For this analysis, the Python implementation of SCENIC was used, namely pySCENIC^{55}. Since dyngen does not generate motif data, step 3 in this analysis is skipped.
The AUROC and AUPR metrics are common metrics for evaluating a predicted GRN with a groundtruth GRN^{56}. To compare a predicted cellspecific GRN with the groundtruth cellspecific GRN, the top 10,000 interactions per cell is retained, and the mean AUROC and AUPR scores are calculated across all cells.
We compared the mean AUROC and AUPR scores obtained by the three CSNI methods across all datasets by performing pairwise nonparametric paired twosided Durbin–Conover tests^{57} using pairwiseComparisons^{58}. Test statistics and p values for the all pairwise combinations are reported in the Source Data file. Reported p values are adjusted for multiple testing using Holm correction^{59}.
Comparison of RNA velocity methods
Three datasets were generated for each of the 14 different predefined backbones, resulting in a collection of 42 datasets. Throughout each of the simulation, the propensity of the transcription and mRNA decay is collected, as the RNA velocity of a gene at any point in the simulation is the difference between the transcription propensity and the mRNA decay propensity.
We applied two RNA velocity methods: velocyto^{30}, as implemented in the velocyto.py package, and scvelo method^{31}, as implemented in the scvelo package. For scvelo, we chose two parameter settings for “mode”, namely “stochastic” and “dynamical”. For both methods, we used the same normalised data as provided by dyngen, with no extra cell or feature filtering, but otherwise matched the parameters to their respective tutorial vignettes as well as possible.
We compared each RNA velocity prediction to the groundtruth using two metrics: the velocity correlation and the velocity arrow cosine. For the velocity correlation, we extracted a ground truth RNA velocity by subtracting for each mRNA molecule the propensity of its production by the propensity of its degradation. If the expression of an mRNA will increase in the future, this value is positive, while it is negative if it is going to decrease. For each gene, we determined its velocity correlation by calculating the Spearman rank correlation between the ground truth velocity with the observed velocity. For the velocity arrow cosine, we determined a set of 100 trajectory waypoints uniformly spread on the trajectory. For each waypoint, we weighted each cell based on a Gaussian kernel on its geodesic distance from the waypoint. These weights were used to calculate a weighted average velocity vector of each waypoint. We then calculated for each waypoint the cosine similarity between this velocity vector and the known direction of the trajectory.
We compared the velocity correlation and velocity arrow cosine scores obtained by velocyto and scvelo across all datasets by performing pairwise nonparametric paired twosided Durbin–Conover tests^{57} using pairwiseComparisons^{58}. Test statistics and p values for the all pairwise combinations are reported in the Source Data file. Reported p values are adjusted for multiple testing using Holm correction^{59}.
Comparison of trajectory alignment
Four custom linear backbones of varying sizes were constructed. For each of these backbones, 10 datasets were generated with 10 different seeds, resulting in a total of 40 datasets. Every dataset is generated in three main steps. First, the GRN is generated based on the given backbone. Next, generating the kinetics, gold standard, and cells is performed twice, resulting in two subdatasets. Finally, the two subdatasets are combined and cells are sampled from the combined dataset. Since the two subdatasets were simulated with different kinetic parameters, the combined dataset will contain two trajectories.
On each combined dataset we applied two trajectory alignment methods, DTW^{24} and cellAlign^{28}. DTW is designed to align temporal sequences by dilating or contracting the sequences to best match each other. cellAlign uses DTW to perform this alignment, but first interpolates and rescales the input data in order to better cope with singlecell omics data.
To evaluate a trajectory alignment method on a combined dataset we computed the geodesic distances of each cell from the start of the trajectory, also called the pseudotime. For each dataset, the pseudotime values are rescaled between 0 and 1 to allow for easier comparison. A trajectory alignment produces a sequence of index pairs [(i_{0}, j_{0}), (i_{1}, j_{1}), …, (i_{N}, j_{N})], where i_{0} and j_{0} are equal to 0 (the first position in both pseudotime series), i_{N} and j_{N} are equal to the respective last positions in the pair of pseudotime series, and [i_{0}, i_{1}, …, i_{N}] and [j_{0}, j_{1}, …, j_{N}] are in ascending order and can contain duplicates values. The ABWAP metric is defined as follows, where pt_{1} and pt_{2} are the unit pseudotime vectors. See Supplementary Fig. 1D for a visual interpretation of this metric.
We compared the ABWAP scores obtained by DTW and cellAlign across all datasets by performing pairwise nonparametric paired twosided Durbin–Conover tests^{57} using pairwiseComparisons^{58}. Test statistics and p values for the all pairwise combinations are reported in the Source Data file. Reported p values are adjusted for multiple testing using Holm correction^{59}.
Comparison of scalability and runtime
Simulating a bifurcating cycle dataset with 10,000 genes and 10,000 required in total 1147 s (Supplementary Note 2). Fixing the number of genes and varying the number of cells showed that the execution time of dyngen scales linearly w.r.t. the number of cells (Supplementary Note 2). Fixing the number of cells and varying the number of genes also showed that the execution time of dyngen scales linearly w.r.t. the number of genes (Supplementary Note 2). These timings were measured using 30 (out of 32) threads using a AMD Ryzen 9 5950X clocked at 3.4GHz.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Source data for box plots in Fig. 2 and Supplementary Figs. 1–3 are provided with this paper. All code and data required to reproduce the analysis are available on GitHub at https://github.com/dynverse/dyngen_manuscript. The datasets generated for the different use cases are available on Zenodo with record number 4637926 (https://doi.org/10.5281/zenodo.4637926). Source data are provided with this paper.
Code availability
Results in this manuscript were generated with R 4.0.3 and dyngen 1.0.0. dyngen is available as an opensource software package at https://cran.rproject.org/package=dyngenand also on Zenodo with record number 4751443 (https://doi.org/10.5281/zenodo.4751443). The analyses performed in this manuscript are available on GitHub at https://github.com/dynverse/dyngen_manuscript. The version numbers of downstream dependencies of dyngen and dyngen_manuscript used in this study are: anndata 0.7.5.1, assertthat 0.2.1, babelwhale 1.0.1, bit 4.0.4, bit64 4.0.5, carrier 0.1.0, cellAlign 0.1.0, codetools 0.218, colorspace 2.00, compiler 4.0.4, crayon 1.4.1.9000, data.table 1.13.4, DBI 1.1.1, debugme 1.1.0, desc 1.2.0, digest 0.6.27, dplyr 1.0.5, dtw 1.223, dynparam 1.0.1, dynutils 1.0.6, dynwrap 1.2.2, ellipsis 0.3.1, fansi 0.4.2, farver 2.1.0, future 1.20.1, future.apply 1.7.0, generics 0.1.0, ggforce 0.3.2, ggplot2 3.3.3, ggraph 2.0.4, ggrepel 0.9.0, GillespieSSA2 0.2.7, globals 0.14.0, glue 1.4.2, graphlayouts 0.7.1, grid 4.0.4, gridExtra 2.3, gtable 0.3.0, gtools 3.8.2, hdf5r 1.3.3, hms 1.0.0, igraph 1.2.6, irlba 2.3.3, jsonlite 1.7.2, lattice 0.2041, lifecycle 1.0.0, lisi 1.0, listenv 0.8.0, lmds 0.1.0, magrittr 2.0.1, MASS 7.353, Matrix 1.32, matrixStats 0.57.0, munsell 0.5.0, parallel 4.0.4, parallelly 1.21.0, patchwork 1.1.1, pbapply 1.43, pheatmap 1.0.12, pillar 1.5.1, pkgconfig 2.0.3, plyr 1.8.6, polyclip 1.100, pracma 2.2.9, processx 3.4.5, proxy 0.424, proxyC 0.1.5, ps 1.6.0, purrr 0.3.4, R6 2.5.0, RANN 2.6.1, rappdirs 0.3.3, RColorBrewer 1.12, Rcpp 1.0.6, RcppParallel 5.0.2, RcppXPtrUtils 0.1.1, readr 1.4.0, remotes 2.2.0, reshape2 1.4.4, reticulate 1.189007, rlang 0.4.10, rprojroot 2.0.2, scales 1.1.1, sctransform 0.3.2, scvelo 0.1.0.9000, stringi 1.5.3, stringr 1.4.0, tibble 3.0.5, tidygraph 1.2.0, tidyr 1.1.2, tidyselect 1.1.0, tools 4.0.4, tweenr 1.0.1, utf8 1.1.4, vctrs 0.3.6, viridis 0.5.1, viridisLite 0.3.0, yaml 2.2.1.
References
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of singlecell RNA sequencing data. Genome Biol. 18, 174 https://doi.org/10.1186/s1305901713050 (2017).
Björnsson, B. et al. Digital twins to personalize medicine. Genome Med. 12, 4 https://doi.org/10.1186/s1307301907013 (2019).
Roy, S., WernerWashburne, M. & Lane, T. A system for generating transcription regulatory networks with combinatorial control of transcription. Bioinformatics 24, 1318–1320 https://doi.org/10.1093/bioinformatics/btn126 (2008).
Hache, H., Wierling, C., Lehrach, H. & Herwig, R. GeNGe: Systematic generation of gene regulatory networks. Bioinformatics 25, 1205–1207 https://doi.org/10.1093/bioinformatics/btp115 (2009).
Schaffter, T., Marbach, D. & Floreano, D. GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27, 2263–2270 https://doi.org/10.1093/bioinformatics/btr373 (2011).
Van den Bulcke, T. et al. SynTReN: A generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinform. 7, 43 https://doi.org/10.1186/14712105743 (2006).
Prill, R. J. et al. Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges. PLoS ONE 5, e9202 https://doi.org/10.1371/journal.pone.0009202 (2010).
Marbach, D. et al. Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl Acad. Sci. 107, 6286–6291 https://doi.org/10.1073/pnas.0913357107 (2010).
Marbach, D. et al. Wisdom of Crowds for Robust Gene Network Inference. Nat. Methods 9, 796–804 https://doi.org/10.1038/nmeth.2016 (2012).
Luecken, M. D. & Theis, F. J. Current best practices in singlecell RNAseq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 https://doi.org/10.15252/msb.20188746 (2019).
Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S. & Marioni, J. C. Normalizing singlecell RNA sequencing data: challenges and opportunities. Nat. Methods 14, 565–571 https://doi.org/10.1038/nmeth.4292 (2017).
Gillespie, D. T. Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81, 2340–2361 https://doi.org/10.1021/j100540a008 (1977).
Vieth, B., Ziegenhain, C., Parekh, S., Enard, W. & Hellmann, I. powsimR: Power analysis for bulk and single cell RNAseq experiments. Bioinformatics 33, 3486–3488 https://doi.org/10.1093/bioinformatics/btx435 (2017).
Papadopoulos, N., Gonzalo, P. R. & Söding, J. PROSSTT: Probabilistic simulation of singlecell RNAseq data for complex differentiation processes. Bioinformatics 35, 3517–3519 https://doi.org/10.1093/bioinformatics/btz078 (2019).
Zhang, X., Xu, C. & Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 10, 1–16 https://doi.org/10.1038/s4146701910500w (2019).
Street, K. et al. Slingshot: Cell Lineage and Pseudotime Inference for SingleCell Transcriptomics. BMC Genomics 19, 477 https://doi.org/10.1186/s1286401847720 (2018).
Parra, R. G. et al. Reconstructing complex lineage trees from scRNAseq data using MERLoT. Nucleic Acids Res. 47, 8961–8974 https://doi.org/10.1093/nar/gkz706 (2019).
LummertzdaRocha, E. et al. Reconstruction of complex singlecell trajectories using CellRouter. Nat. Commun. 9, 892 https://doi.org/10.1038/s4146701803214y (2018).
Lin, Y. et al. scClassify: Sample size estimation and multiscale classification of cells using single and multiple reference. Mol. Syst. Biol. 16, e9389 https://doi.org/10.15252/msb.20199389 (2020).
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for singlecell RNAseq data. F1000Res. 7, 1141 https://doi.org/10.12688/f1000research.15666.2 (2018).
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of singlecell trajectory inference methods. Nat. Biotechnol. 37, 547–554 https://doi.org/10.1038/s4158701900719 (2019).
Soneson, C. & Robinson, M. D. Bias, robustness and scalability in singlecell differential expression analysis. Na. Methods 15, 255–261 https://doi.org/10.1038/nmeth.4612 (2018).
Weber, L. M. et al. Essential guidelines for computational method benchmarking. Genome Biol. 20, 125 https://doi.org/10.1186/s1305901917388 (2019).
Giorgino, T. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package. Journal of Statistical Software, https://doi.org/10.18637/jss.v031.i07 (2009).
Cacchiarelli, D. et al. Aligning singlecell developmental and reprogramming trajectories identifies molecular determinants of myogenic reprogramming outcome. Cell Syst. 7, 258–268.e3 https://doi.org/10.1016/j.cels.2018.07.006 (2018).
Kanton, S. et al. Organoid singlecell genomic atlas uncovers humanspecific features of brain development. Nature 574, 418–422 https://doi.org/10.1038/s4158601916549 (2019).
McFalineFigueroa, J. L. et al. A pooled singlecell genetic screen identifies regulatory checkpoints in the continuum of the epithelialtomesenchymal transition. Nat. Genet. 51, 1389–1398 https://doi.org/10.1038/s4158801904895 (2019).
Alpert, A., Moore, L. S., Dubovik, T. & ShenOrr, S. S. Alignment of singlecell trajectories to compare cellular expression dynamics. Nat. Methods 15, 267–270 https://doi.org/10.1038/nmeth.4628 (2018).
Zeisel, A. et al. Coupled premRNA and mRNA dynamics unveil operational strategies underlying transcriptional responses to stimuli. Mol. Syst. Biol. 7, 529 https://doi.org/10.1038/msb.2011.62 (2011).
Manno, G. L. et al. RNA Velocity of Single Cells. Nature 560, 494–498 https://doi.org/10.1038/s4158601804146 (2018).
Bergen, V., Lange, M., Peidli, S., Wolf, F. A. & Theis, F. J. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38 1408–1414 https://doi.org/10.1038/s4158702005913 (2020).
Aibar, S. et al. SCENIC: singlecell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 https://doi.org/10.1038/nmeth.4463 (2017).
Kuijjer, M. L., Tung, M. G., Yuan, G., Quackenbush, J. & Glass, K. Estimating samplespecific regulatory networks. iScience 14, 226–240 https://doi.org/10.1016/j.isci.2019.03.021 (2019).
Liu, X., Wang, Y., Ji, H., Aihara, K. & Chen, L. Personalized characterization of diseases using samplespecific networks. Nucleic Acids Res. 44, e164–e164 https://doi.org/10.1093/nar/gkw772 (2016).
HuynhThu, V. et al. Inferring regulatory networks from expression data using treebased methods. PLoS ONE 5, e12776 https://doi.org/10.1371/journal.pone.0012776 (2010).
Moerman, T. et al. GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics 35, 2159–2161 https://doi.org/10.1093/bioinformatics/bty916 (2019).
Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic singlenucleotidevariant detection. Nat. Methods 12, 623–630 https://doi.org/10.1038/nmeth.3407 (2015).
Smith, S. & Grima, R. Spatial stochastic intracellular kinetics: a review of modelling approaches. Bull. Math. Biol. 81, 2960–3009 https://doi.org/10.1007/s1153801804431 (2019).
Rekhtman, N., Radparvar, F., Evans, T. & Skoultchi, A. I. Direct interaction of hematopoietic transcription factors PU.1 and GATA1: functional antagonism in erythroid cells. Genes Dev. 13, 1398–1411 https://doi.org/10.1101/gad.13.11.1398 (1999).
Xu, H. et al. Regulation of bifurcating B cell trajectories by mutual antagonism between transcription factors IRF4 and IRF8. Nat. Immunol. 16, 1274–1281 (2015).
Graf, T. & Enver, T. Forcing cells to change lineages. Nature 462, 587 https://doi.org/10.1038/nature08533 (2009).
Wang, J., Zhang, K., Xu, L. & Wang, E. Quantifying the Waddington landscape and biological paths for development and differentiation. Proc. Natl Acad. Sci. 108, 8257–8262 https://doi.org/10.1073/pnas.1017017108 (2011).
Ferrell, J. E. Bistability, Bifurcations, and Waddington’s Epigenetic Landscape. Current Biology 22, R458–R466 https://doi.org/10.1016/j.cub.2012.03.045 (2012).
Yosef, N. et al. Dynamic regulatory network controlling {TH17} cell differentiation. Nature 496, 461–468 (2013).
Lizio, M. et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biology 16, 22 https://doi.org/10.1186/s1305901405606 (2015).
Gibson, M. A. & Bruck, J. A probabilistic model of a prokaryotic gene and its regulation. Computational Methods in Molecular Biology: From Genotype to Phenotype. (MIT press, 2000).
Schwanhäusser, B. et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 https://doi.org/10.1038/nature10098 (2011).
Schilstra, M. J. & Nehaniv, C. L. BioLogic: Gene Expression and the Laws of Combinatorial Logic. Artif. Life 14, 121–133 https://doi.org/10.1162/artl.2008.14.1.121 (2008).
Cannoodt, R., Saelens, W., Todorov, H. & Saeys, Y. Singlecell omics datasets containing a trajectory. Zenodo, https://doi.org/10.5281/zenodo.1211532 (2018).
Soneson, C. & Robinson, M. D. Towards unified quality verification of synthetic count data with countsimQC. Bioinformatics 34, 691–692 https://doi.org/10.1093/bioinformatics/btx631 (2018).
Kuijjer, M. L., Tung, M., Yuan, G., Quackenbush, J. & Glass, K. Estimating samplespecific regulatory networks. iScience. 14, 226–240 https://doi.org/10.1016/j.isci.2019.03.021 (2019).
Glass, K., Huttenhower, C., Quackenbush, J. & Yuan, G.C. Passing messages between biological networks to refine predicted interactions. PLOS ONE 8, e64832 https://doi.org/10.1371/journal.pone.0064832 (2013).
Kuijjer, M. L., Hsieh, P.H., Quackenbush, J. & Glass, K. lionessR: single sample network inference in R. BMC Cancer 19, 1003 https://doi.org/10.1186/s1288501962357 (2019).
Wickham, H. et al. Welcome to the Tidyverse. https://doi.org/10.21105/joss.01686 (2019).
Van de Sande, B. et al. A scalable SCENIC workflow for singlecell gene regulatory network analysis. Nat. Protoc. 15, 2247–2276 https://doi.org/10.1038/s4159602003362 (2020).
Marbach, D., Schaffter, T., Mattiussi, C. & Floreano, D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 16, 229–239 https://doi.org/10.1089/cmb.2008.09TT (2009).
Conover, W. J. & Iman, R. L. On multiplecomparisons procedures. Technical report, Technical report, Los Alamos Scientific Laboratory (1979).
Patil I. pairwiseComparisons: Multiple Pairwise Comparison Tests. https://CRAN.Rproject.org/package=pairwiseComparisons (2019).
Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).
Acknowledgements
This project has been made possible in part by grant number 2020218899 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program.
Author information
Authors and Affiliations
Contributions
W.S. and R.C. designed the study. R.C., W.S., and L.D. performed the experiments and analysed the data. R.C. and W.S. implemented the dyngen software package. R.C., W.S., L.D., and Y.S. wrote the manuscript. Y.S. supervised the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cannoodt, R., Saelens, W., Deconinck, L. et al. Spearheading future omics analyses using dyngen, a multimodal simulator of single cells. Nat Commun 12, 3942 (2021). https://doi.org/10.1038/s41467021241522
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021241522
This article is cited by

Alignment of singlecell trajectory trees with CAPITAL
Nature Communications (2022)

CellRank for directed singlecell fate mapping
Nature Methods (2022)

A deep manifoldregularized learning model for improving phenotype prediction from multimodal data
Nature Computational Science (2022)

Differential abundance testing on singlecell data using knearest neighbor graphs
Nature Biotechnology (2022)

Potential and limitations of digital twins to achieve the Sustainable Development Goals
Nature Sustainability (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.