Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM

Single-cell transcriptomic assays have enabled the de novo reconstruction of lineage differentiation trajectories, along with the characterization of cellular heterogeneity and state transitions. Several methods have been developed for reconstructing developmental trajectories from single-cell transcriptomic data, but efforts on analyzing single-cell epigenomic data and on trajectory visualization remain limited. Here we present STREAM, an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data. We have tested STREAM on several synthetic and real datasets generated with different single-cell technologies. We further demonstrate its utility for understanding myoblast differentiation and disentangling known heterogeneity in hematopoiesis for different organisms. STREAM is an open-source software package.

Along the lymphoid differentiation trajectory (S2,S1,S0), STREAM identified HSCs-specific genes like Mpl 12 ,Tgm2 13 (Supplementary Figure 2), whose expressions are repressed towards lymphoid differentiation, and LMPPs-specific genes like lghv1-81, Ccl3 (Supplementary Figure 2), whose expressions are activated during the differentiation as discussed in the original study 1 . Taken together, these analyses validate the accuracy of trajectory reconstruction of STREAM in recapitulating key bifurcation events and regulators of early blood development differentiation at single-cell resolution.

Supplementary Note 2: STREAM trajectory inference in high dimensional spaces
As discussed in the main text, we were able to recover four different lineages in zebrafish hematopoietic inDrop data. We hypothesized that we cannot separate B and T cells since the number of dimensions used to infer the trajectories may be insufficient to fully recapitulate all the lineages. Since Monocle2 is one of the state-of-the-art methods designed to capture complex trajectories, we decided to qualitatively compare the inferred trajectories of STREAM and Monocle2 for this challenging dataset. We observed indeed that with less than 4 components both methods failed to recover the B and T lymphoid branches (Supplementary Figure 4b,c) but accurately identified the neutrophil, macrophage and lymphoid branches. However, Monocle2 failed to recover a reasonable starting branch for stem cells that are instead mixed with neutrophils. We also observed that with Monocle2 the progenitor cells are misplaced in the developmental tree since they are assigned after the first branching event that gives raise instead to three separate erythroid branches. For STREAM, we observed that simply increasing the number of dimensions may introduce artifacts (trivial branches). Based on the revised seeding strategy (Methods and Supplementary Figure 4a) and using 20 dimensions as in the original study by Tang et al. 14 , we obtained more detailed trajectories and lineage separation (Supplementary Figure  4b). First, we were able to clearly separate lymphoid cells into B cells and T/NK cells. Second, we uncovered a rare and well separated branch that contains only the cell type labeled as "Macrophages/Myeloid" by Tang et al (comprising less than 1.36% of the marrow). The previous original study could not assign these unique cell types as arising in either the myeloid or lymphoid cell lineages. By contrast, we observed that the branch capturing these cells was closer to the branch capturing the lymphoid cells. These results were verified by independent analyses using Monocle2 and UMAP (Supplementary Figure 4c-d), each of which confirmed shared lineage with lymphoid cells. As reported previously 14 , we also observe that these cells express well-known macrophage marker genes including mpeg1.1, ccr9b, tlr7, ccl39.3, and p2rx3b. Notably, our marker gene detection analysis uncovered genes that are exquisitely specific for this branch such as irf8 and ctsbb (Supplementary Figure 5). We also observed genes in the main Macrophages clade that are not expressed in this rare subpopulation such as mfap4, grn1 or marco (Supplementary Figure 5). As such, we believe that these cells correspond to a separate rare cell type/state. Third, we were able to uncover novel sub-branches within both the macrophages and neutrophil groups. Diversity in these lineages is well-known in mouse and human. For example, N1 and N2 neutrophils exert a wide diversity of cellular responses in cancer following polarization into each cell state by TGF-b and type 1 interferons, respectively. In addition, several cell states for neutrophils have been previously described such as immature, resting, primed, and active 15 . It is also well-appreciated that macrophages have a diversity of cellular states that depends on the age of animal, infection/disease type, and tissue in which macrophages become active 16,17,18 , and several studies have identified in zebrafish macrophage subsets with important functional differences 19,20 . Notably, several genes that are uniquely expressed in these sub-branches of the neutrophil and macrophage lineages are shown in Supplementary Figure 5. Taken together we believe that these new sub-branches correspond to separate cell types/states and that further experimental analyses are necessary to fully characterize the identity of these populations.
Also, in this case, we compared our results in the 20-dimensional space with Monocle2. Both methods separated lymphoid cells into B cells and T cells. However, Monocle2 does not map the progenitor cells to an intermediate branch toward more defined lineages, instead these cells are placed on a separate and terminal branch that does not progress further. In addition, Monocle2 detects three erythroid branches with 20 dimensions while with 4 dimensions four separate erythroid branches were recovered, making the interpretation of this refined analysis not intuitive. STREAM instead maintains a single branch for the erythroid cells in both spaces. Apart from these differences, both methods further detected multiple sub-branches for the macrophage and neutrophil lineages.
To further validate the generalizability of our approach to learn complex trajectories using STREAM, we also re-analyzed a challenging dataset generated by Paul et al. 3 , where ~2700 myeloid cells were separated into CMP, MEP and GMP based on the surface markers and profiled using MARS-seq. In this study, clustering analysis uncovered 6 main cell types, including erythrocytes, megakaryocytes, DC, basophil/eosinophil progenitors, monocytes, neutrophils. Here, our STREAM analysis identified a simple bifurcation, which corresponds to the differentiation of CMP cells to GMP cells and erythroid cells. More refined analysis (by changing the dimensions from 2 to 10) recovers all the six different committed cell branches defined by the original study (Supplementary Figure 6).

Supplementary Note 3: Comparison of STREAM with existing methods
First, we discuss the core algorithms and limitations of all method considered in comparison with STREAM. Monocle2 21 uses reversed graph embedding (by default DDRtree) to learn an explicit principal graph to describe the single-cell transcriptomic data. During the principal graph learning step, in each iteration, DDRtree moves cells to the nearest vertex, hence distorting the original configuration of cells in the manifold that may result in an uneven distribution with more cells close to vertices and fewer cells in between (Supplementary Figure 7,8). Mpath 22 first clusters cells and designates landmarks. Then, based on the landmarks a weighted neighborhood network is constructed and subsequently trimmed to obtain a state transition network. Mpath requires prior information (e.g. FACS sorting and time points) to designate landmarks and the final transition network is sensitive to the chosen landmarks. Wishbone 23 builds a kNN graph based on the most relevant diffusion components by using gene set enrichment analysis (GSEA), then a random sample of cells termed waypoints are used for ordering cells and branch identification based on the inconsistency between "waypoints". By default, Wishbone can only detect a single bifurcation (i.e. two cell fates). DPT 24 first reduces the dimensionality using a diffusion map approach. Then in the diffusion map space a random-walk-based distance is computed. Branching points are identified by comparing two DPT orderings over cells. SLICER 25 uses locally linear embedding (LLE) to reduce the dimension and then uses a KNN graph to order cells by the shortest path distance from root cell. Branches are detected based on geodesic entropy. All these three methods, i.e., Wishbone, DPT and SLICER cannot infer trajectories without specifying a start cell. In addition, they cannot explicitly show cellular trajectories with a specific topological structure. Instead they simply visualize cells using the first two components of the dimensionality reduction method adopted. scTDA 26 uses the Mapper algorithm 27 to build a topological representation. Briefly, a low dimensional space (obtained for example by multidimensional scaling (MDS)) is first divided into overlapping bins and cells within each bin are clustered in the original space. The topological structure is obtained by connecting clusters that share at least one cell. However, spurious edges may appear since an edge could be formed as long as two clusters have a non-empty intersection. In addition, the obtained topology may also depend on the binning strategy used. TSCAN 28 applies principal component analysis to a gene-cluster level expression matrix. Then top principal components (PCs) are selected to cluster cells and a minimum spanning tree (MST) is constructed based on the clustering solution. Finally, cells are projected and ordered on the tree structure. By default, TSCAN does not report branching events, in fact reports only the linear path with the largest number of clusters. SCUBA 29 first constructs a binary tree by iteratively clustering and refines the tree based on a penalized likelihood function. Then it models gene dynamics using bifurcation theory. This method can infer multiple bifurcation events however cannot model branching events with more than two cells fates. GPfates 30 uses OMGP (Overlapping Mixture of Gaussian Processes) to model the temporal dynamics. However, it requires a prespecified number of trends/trajectories. In addition, both SCUBA and GPfates requires temporal information as input to model the dynamics. SCUBA uses principal curve in tSNE space to infer cell order. GPfates uses GPLVM (Gaussian Process Latent Variable Model) to infer pseudotime. PHATE 31 is a visualization method that preserves well single-cell trajectory structure based on data diffusion but does not provide branch assignment and pseudotime information.
Next, to qualitatively compare different trajectory inference methods, we have considered several features summarized in Supplementary Table 1 and explained below: Structure refers to the hypothetical structures different methods are able to infer without modifying their original codes.
Deterministic refers to the ability of a method to consistently produce the same output if applied multiple times on the same input. For both SCUBA and Wishbone, the indeterminism is caused by the tSNE step since SCUBA learns pseudotime in the tSNE space and Wishbone visualizes the final trajectories in the tSNE plot. In scTDA, the topological analysis RootedGraph() returns different results on different runs. In GPfates, both dimensionality reduction (Bayesian GPLVM) and trajectory inference steps (Overlapping Mixture of Gaussian Processes (OMGP)) often do not converge to the same solution.
Ease of installation refers to the ability to install a method via a public package management system (Bioconda, pypi, CRAN, Bioconductor, etc.) or to use it through an accessible website (++) or through a local custom setup script that may need extra packages to be installed ('+').
Extra input indicates that extra information is required in order to perform trajectory inference.
Explicit trajectory visualization: refers to the ability of a method to provide explicitly a tree (or graph) skeleton based on a set of curves to describe the cellular developmental trajectories.
Single-cell-level trajectory visualization refers to the ability of a method to explicitly align single cells to the trajectories. Some methods provide instead only an abstract tree (or graph) where each node represent either the cluster size or the expression variance, however cells cannot be displayed individually.
Density level trajectory visualization refers to the ability of a method to display the cell density along a given trajectory.
Automatic marker gene detection indicates whether the tool supports automatic marker gene discovery for the different branches or paths.
Mapping feature refers to the ability to fix a reference structure and to map new cells to it without recomputing trajectories.
Tested on epigenomic data means that the method has been previously applied to epigenomic data.
End-to-end pipeline for scATAC-seq means that the method provides all the functions necessary to perform trajectory inference starting from raw count data. Next we describe in detail the comparison based on the synthetic dataset by Rizvi et al 26 , presented in the main text.
PHATE, a dimensionality reduction method, qualitatively preserves cellular trajectories but does not provide branch assignment information for cells. STREAM, scTDA, Monocle2, and Mpath can accurately reconstruct two bifurcation events. SLICER detects two bifurcation events, but the branching node is not accurately located. Wishbone instead can only detect one simple bifurcation even though the obtained 2D manifold clearly shows two bifurcations. DPT detects too many branches and their positions are difficult to assess from the proposed visualization. GPFates requires to prespecify the number of trajectories (referred to as trends in the original study and manually set for this dataset to 3). Although the generated curves initially follow the correct branches, they incorrectly converge at the end. SCUBA fails to detect bifurcation events in this dataset. We also noticed that scTDA, SCUBA, and Mpath do not provide single-cellresolution visualization. Wishbone, DPT and SLICER do not provide explicit trajectory visualization, so they cannot visualize both time points and recovered branch assignments in the same plot. Monocle2 provides an explicit trajectory visualization at single-cell resolution, however it doesn't provide information about cell density and the population composition.
To make our comparison analyses for both synthetic and real datasets reproducible, we also provide Jupyter notebooks documenting them (Supplementary Data 2). In addition, we summarize below some important details on the data and procedures used to run each method and if it was necessary to run them with non-standard parameters.
For all datasets, we kept all the cells during the comparison analyses. For the synthetic and qPCR datasets, we used all the genes (500 genes and 175 genes respectively). For the scRNA-seq dataset, to fairly assess the trajectory inference procedure and to mitigate the effect of different gene selection methods, we used the same set of variable genes (835 genes) selected using the function select_variable_genes() in STREAM with default parameters. By default, we log2transformed the gene expression matrices unless a method explicitly requires the raw count data. As mentioned above, we tried to follow the method documentation and used the default parameters as much as possible. GPFates requires time information to perform trajectory inference. The default pseudotime inference method (Bayesian GPLVM) that should be used initially to recover the time information does not work well when the selected genes are highly independent. As suggested by the author we used DPT to infer pseudotime before inferring trajectories with GPFates. For indeterministic methods, we randomly picked one result from several runs.

Supplementary Note 4: STREAM interactive website
In order to make STREAM user-friendly and accessible to non-bioinformatician, we have created an interactive website: http://stream.pinellolab.org. The website implements the main features of the command line version and provides interactive and exploratory panels to zoom and visualize single-cells on any given branch.
The website offers two functions: 1) to run STREAM on single-cell transcriptomic or epigenomic data provided by the users and 2) the first interactive database of precomputed trajectories with results for seven published datasets 1, 14,32,33,34,35,36 . The users can visualize and explore cells' developmental trajectories, subpopulations and their gene expression patterns at single-cell level. (Supplementary Figure 11).
The website can also run on a local machine using the provided Docker image we have created.
To get the Docker image, simply execute the following command: docker pull pinellolab/stream_web After this step, to run the website on the local machine execute the following command: docker run -p 10001:10001 pinellolab/stream_web After the execution of this command the user will have a local instance of the website that can be opened with any browser (Chrome, Firefox, Safari, etc.) and accessible at the URL: http://localhost:10001 As in the hosted version, using the local version of the STREAM webpage, users can compute and explore trajectories on the page 'Compute'. On this page the users can also optionally create and save a summary report in a .zip file. The obtained report can be shared with other users and/or visualized interactively and without recomputing trajectories on the page 'Visualize'.

Supplementary Note 5: STREAM bioconda package
In order to provide more control to advanced users, we have created also a Python package that can be used to perform fully customizable step-by-step analysis. This package can be easily installed using the Bioconda package channel 37 . The STREAM bioconda package can be installed following these three simple steps:

1) If Anaconda (or miniconda) is already installed with Python 3 skip to 2) otherwise download and install Python3
Anaconda from here: https://www.anaconda.com/download/ 2) Add the Bioconda channel with the following commands: conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge 3) Create an environment named 'stream_env', install stream, jupyter and activate it with the following commands: For single cell RNA-seq analysis: conda create -n stream_env python=3. 6  We provide all the scripts used for the analyses in this paper as Jupyter Notebooks (Supplementary Data 1). These notebooks can be used as tutorials since they nicely show how to use and combine the functions listed in this section and also illustrate how to set the different parameters to perform different analyses with STREAM.

Supplementary Note 6: STREAM command line interface
STREAM can be easily used also with a simple command line interface. It is possible to install the command line version of STREAM with Docker.

Installation with Docker
With Docker no installation is required, the only dependence is Docker itself. Docker can be downloaded freely from here: https://store.docker.com/search?offering=community&type=edition To get a local copy of STREAM execute the following command: docker pull pinellolab/stream

STREAM usage and example dataset
The main and required input file is a tab-separated gene expression matrix (raw counts or scaled) in. tsv format. Each row represents a unique gene and each column is one cell.
The following table shows the first 5 rows (genes) and 5 columns (cells) of the provided example dataset

Cell label color:
Customized colors to use for the different cell labels. The first column specifies cell labels and the second column specifies the color in the format of hex. No header is necessary:

Gene list:
It contains genes that users may be interested in visualizing in subway map and stream plot in addition to the genes detected by STREAM. Genes are listed in one column. No header is necessary: . . .

Feature genes:
It contains genes that the user can specify and that are used as features to infer trajectories. instead of using the automatic feature selection of STREAM. No header is necessary: To run STREAM, after the installation at the command-line interface execute:

$ docker run -v ${PWD}:/data -w /data pinellolab/stream --help [options]
Users can specify the following options: To visualize genes of interest, user can provide a gene list file by adding -g, for example: gene_list.tsv.gz. Meanwhile, by adding the flag -p, STREAM will use the precomputed file obtained from the first running (In this way, STREAM will import precomupted pkl file so the analysis will skip structure learning part and only execute the step of visualizing genes): Using Bioconda: $ stream -m data_Nestorowa.tsv.gz -l cell_label.tsv.gz -c cell_label_color.tsv.gz -g gene_list.tsv.gz -p Using Docker: $ docker run -v ${PWD}:/data -w /data pinellolab/stream -m data_Nestorowa.tsv.gz -l cell_label.tsv.gz -c cell_label_color.tsv.gz -g gene_list.tsv.gz -p Users can also provide a set of gene names separated by comma or specify the root by adding -r: Using Bioconda: $ stream -m data_Nestorowa.tsv.gz -l cell_label.tsv.gz -c cell_label_color.tsv.gz -g Gata1,Mpo -r S1 -p Using Docker: $ docker run -v ${PWD}:/data -w /data pinellolab/stream -m data_Nestorowa.tsv.gz -l cell_label.tsv.gz -c cell_label_color.tsv.gz -g Gata1,Mpo -r S1 -p To explore potential marker genes, it is possible to add the flags --DE, --TG, or --LG to detect DE (differentially expressed) genes, transition gens, and leaf genes respectively: Using Bioconda: Example of the mapping feature: To explore the feature mapping, users need to provide two datasets, one is used for inferring trajectories. The other is the dataset that is going to be mapped to the inferred trajectories. Here we take data_Olsson.tsv.gz, data_perturbation.tsv (Olsson, A. et al.,2016) as an example. We assume that all the datasets are in the current folder.

Example with scATAC-seq data:
To perform scATAC-seq trajectory inference analysis, three files are necessary, a .tsv file of counts in compressed sparse format, a sample file in .tsv format and a region file in .bed format: 1. Count file: a tab-delimited compressed matrix in sparse format (column-oriented). It contains three columns. The first column specifies the rows indices (the regions) for non-zero entry. The second column specifies the columns indices (the sample) for non-zero entry. The last column contains the number of reads in a given region for a particular cell. No header is necessary: To perform scATAC-seq trajectory inference analysis, three files are necessary, a .tsv file of counts in compressed sparse format, a sample file in .tsv format and a region file in .bed format. (Buenrostro, J.D. et al., 2018). We assume that they are in the current folder.
Using these three files, users can run stream_atac with the following command to preprocess sc-atac-seq data and get a z_score matrix file named 'zscore.tsv.gz' (This step may take a couple of hours with a modest machine): Using Bioconda: $ stream_atac -c count_file.tsv.gz -s sample_file.tsv.gz -r region_file.bed.gz Then, take z-score file as input to infer trajectories using stream: Using Bioconda: $ stream --atac -m zscore.tsv.gz -l cell_label.tsv.gz -c cell_label_color.tsv.gz --lle_components 4 Using Docker: $ docker run -v ${PWD}:/data -w /data pinellolab/stream --atac -m zscore.tsv.gzl cell_label.tsv.gz -c cell_label_color.tsv.gz --lle_components 4 Output description STREAM write all the results by default in the folder stream_result, unless a different directory is specified by the user with the flag -o. This folder mainly contains the following files and directories: • std_vs_means.pdf: selected most variable genes. • dimension_reduction.pdf: projected cells in the MLLE 3D space.
• seed_elastic_principal_graph_skeleton.pdf: the initial structure skeleton with all the nodes and edges. • seed_elastic_principal_graph.pdf: the initial structure with cells. • ElPiGraph_analysis.pdf: the log of ElPiGraph strucuture learning. • elastic_principal_graph_skeleton.pdf: the elastic principal graph skeleton.
• elastic_principal_graph.pdf: the elastic principal graph with cells. • optimizing_elastic_principal_graph_skeleton.pdf: the elastic principal graph skeleton after optimizing branching. • optimizing_elastic_principal_graph.pdf: the elastic principal graph with cells after optimizing branching.
• extending_elastic_principal_graph_skeleton.pdf: the elastic principal graph with cells after extending leaf nodes. • extending_elastic_principal_graph.pdf: the elastic principal graph skeleton after extending leaf nodes.
• flat_tree.pdf: flat tree plot. • cell_info.tsv: cell information file containing branch assignment id and pseudotime. • stream_result.pkl: stores anndata object from the analysis. It can be imported later to reproduce the whole analysis. • sub-folder 'transition_genes' contains several files, one for each branch id, for example for (S0,S1): o transition_genes_S0_S1.pdf: Detected transition genes plot for branch S0_S1. Orange bars are genes whose expression values increase from state S0 to S1 and green bars are genes whose expression values decrease from S0 to S1 o transition_genes_S0_S1.tsv:  Data', and 'Help'. In the 'Visualize'' page, users can visualize single cell trajectories from several published datasets or import the STREAM report zip file from their own STREAM analysis. In the 'Compute' page users can infer and visualize trajectories using STREAM using their own scRNA-seq or scATAC-seq datasets. In the 'Help' page, users can easily learn how to use STREAM website with the assistance of a detailed guide.