Dissecting endothelial to haematopoietic stem cell transition by single-cell transcriptomic and functional analyses

Haematopoietic stem cells (HSCs) in adults are believed to be born from hemogenic endothelial cells (HECs) in mid-gestational mouse embryos. Due to rare and transient nature, the HSC-competent ECs have never been stringently identified and accurately captured, let alone their genuine vasculature precursors. Here, we firstly used high-precision single-cell transcriptomics to unbiasedly examine relevant EC populations at continuous developmental stages and transcriptomically identified putative HSC-primed HECs. Combining computational prediction and in vivo functional validation, we precisely captured HSC-competent HECs by newly constructed Neurl3-EGFP reporter mouse model, and realized enrichment further by surface marker combination. Surprisingly, endothelial-haematopoietic bi-potential was rarely but reliably witnessed in culture of single HECs. Noteworthy, primitive vascular ECs experienced two-step fate choices to become HSC-primed HECs, resolving several previously observed contradictions. Taken together, comprehensive understanding of endothelial evolutions and molecular programs underlying HSC-primed HEC specification in vivo will facilitate future investigations directing HSC production in vitro.


INTRODUCTION 44
The adult haematopoietic system, consisted mainly of haematopoietic stem cells (HSCs) and 45 their multi-lineage progenies, is believed to be derived from hemogenic endothelial cells 46 (HECs) in mid-gestational embryos 1,2 . It is generally accepted that while still embedded in the 47 endothelial layer and presenting endothelial characteristics, HECs begin to express key 48 hemogenic transcription factor Runx1 and have hemogenic potential 3,4 . Different from 49 haematopoietic progenitors, HECs lack the expression of haematopoietic surface markers, 50 such as CD41 and CD45, which mark the population capable of generating haematopoietic 51 progenies when directly tested in colony-forming unit assays 3,5 . Haematopoietic stem and 52 progenitor cells (HSPCs) are visualized to emerge from aortic endothelial cells (ECs) via a 53 transient and dynamic process called endothelial-to-haematopoietic transition to form 54 intra-aortic haematopoietic clusters (IAHCs) 6-10 . Being located within IAHCs or to the deeper 55 sub-endothelial layers, pre-HSCs serve as the important cellular intermediates between 56 HECs and HSCs, featured by their inducible repopulating capacity and priming with 57 haematopoietic surface markers [11][12][13][14][15] . The specification of HSC-primed HECs is the initial and 58 one of the most pivotal steps for vascular ECs to choose a HSC fate. However, as the precise 59 Trajectory analysis by Monocle 2 suggested that along the arterial maturation path from 175 earlyAEC towards lateAEC, HEC was segregated out from earlyAEC at E9.5-E10.0 (Fig. 1g). 176 The gradual up-regulation of hemogenic genes, including Runx1 and Spi1, was accompanied 177 by the gradual down-regulation of both endothelial and arterial genes along the HEC 178

Enrichment of the HSC-competent HECs by newly established Neurl3-EGFP reporter 286
In an effort to search for single markers to distinguish HSC-primed HECs from non-HECs or 287 those CD45 -CD43haematopoietic cells sharing an endothelial immunophenotype, we 288 computationally screened for the genes significantly overrepresented in HEC cluster as 289 compared to each of the other four clusters, including one haematopoietic cluster (HC) and 290 three vascular EC clusters (vEC, earlyAEC and lateAEC) (Supplementary information, Fig.  291 S1f). Totally eleven genes were screened out, which were then designated as signature 292 genes of HSC-primed HEC, including three TFs (Mycn, Hlf and Gfi1) but no cell surface 293 markers ( Fig. 4a; Supplementary information, Table S5). Most of them manifested similarly 294 high expression in T1 pre-HSCs, with six of them, namely Neurl3, Dnmt3b,Mycn,Hlf,Gfi1,295 and Gck, belonged to pre-HSC signature genes 11 (Fig. 4a). 296

297
To further validate the bioinformatics findings and precisely determine the localization of 298 these HSC-primed HECs, we specially chose Neurl3 to establish a fluorescence reporter 299 mouse line with the mind of possessing enough sensitivity, as the median expression of 300 Neurl3 was the highest among these signature genes ( Fig. 4a; Supplementary information, 301 Table S5). By CRISPR/Cas9-mediated gene knockin strategy, the EGFP was inserted into 302 the translational initiation codon of mouse Neurl3 gene to ensure that EGFP would be 303 expressed in exactly the same way as Neurl3 (Fig. 4b). We firstly evaluated the Neurl3-EGFP 304 expression by flow cytometric analysis (Fig. 4c). At E10.0 AGM region, about half of the Given the lacking of suitable antibodies to directly determine the anatomical distribution of 350 PK44 cells, which have been proven as HSC-competent HECs (Fig. 2c-e), we compared the 351 immunophenotype of PK44 and IAHC cells, known as CD31 + Kit high 7 , regarding their 352 relationship with Neurl3-EGFP expression whose localization was clearly defined (Fig. 5a). 353 Both of them were mainly Neurl3-EGFP + , with most CD31 + Kit high cells being 354 CD41/CD43/CD45-positive haematopoietic cells as previously reported 7 (Fig. 5b) As less than half of Neurl3-EGFP + ECs were PK44 cells (Fig. 5b), we next explored the in 364 vitro functional relationship of PK44 and non-PK44 fractions within CD44 + Neurl3-EGFP + ECs 365 by index-sorting. From E9.5 to E10.5, all three kinds of potential, including endothelial-only, 366 haematopoietic-only, and endothelial-haematopoietic bi-potential, could be detected in 367 CD44 + Neurl3-EGFP + ECs, with different frequencies Supplementary information,368 Fig. S3f). In E10.5, the potential was remarkably biased to endothelial as compared to E9.5 369 and E10.0 (Fig. 5d), which should be due to the prompt loss of Neurl3-EGFP-labeled HECs 370 and the possible labeling of some lateAECs by Neurl3-EGFP (Fig. 4a, h). Of note, all three kinds of potential were obviously higher in PK44 than non-PK44 fraction, with 372 endothelial-haematopoietic bi-potential exclusively detected in PK44 cells (Fig. 5d). 373 Therefore, PK44 represented the enriched functional sub-populations within Neurl3-EGFP + 374 HECs. All three kinds of potential did not show an evidently biased distribution regarding 375 CD44 or Neurl3-EGFP expression level by index sorting analysis (Fig. 5e). Interestingly, cells 376 with the haematopoietic rather than endothelial potential intended to have smaller side scatter 377 density on FACS (Fig. 5e). 378 379

Stepwise fate choices of HSC-primed HECs from primitive vascular ECs 380
In an effort to decipher the stepwise specification of the HSC-primed HECs, we added the 381 immunophenotypic EC samples, from the stage of initial aortic structure formation at E8.0 33 382 to E9.0, to achieve seamless sampling with continuous developmental stages 383 (Supplementary information, Fig. S4a). All the transcriptomically identified ECs were 384 re-clustered into six clusters, with four of them basically consistent with those previously 385 defined, namely vEC, earlyAEC, lateAEC, and HEC. The newly added samples were mainly 386 distributed into three clusters, vEC, primitive EC (pEC) featured by Etv2 expression and 387 potential of the HSC-competent HECs reached the peak at the time point about 0.5 days 482 before the first HSC emergence, and rapidly decreased thereafter ( Fig. 5d; Supplementary  483 information, Fig. S1f and S3f). Interestingly, the endothelial-haematopoietic bi-potential was 484 still maintained until T1 pre-HSC stage at E11.0 (Fig. 3i), when cells have begun to express 485 haematopoietic surface markers (Fig. 3b) and turned the shape into round 15 . The data 486 suggest that the haematopoietic fate might not have been fixed in T1 pre-HSC, which needs 487 further investigations. 488

489
We also precisely decoded the developmental path of HSC-primed HECs from the initially 490 specified vascular ECs, the view of which has been generally neglected previously. We found 491 that the genes and pathways involved in arterial development and Notch signaling were firstly 492 increased and then decreased once upon HEC specification (Fig. 6e, g). Supportively, 493 several seemingly contradictory findings have been reported regarding the role of Notch 494 signaling in HEC specification. For example, activation of arterial program or Notch signaling 495 is known to be required for HEC specification in mouse embryos or generation of HECs with 496 lymphoid potential from human pluripotent stem cells [44][45][46] . On the other hand, repression of 497 arterial genes in EC after arterial fate acquisition leads to augmented haematopoietic output 498 47 . Noteworthy, we revealed two bifurcates during HSC-primed HEC specification along the 499 path from primitive vascular EC, suggesting two-step fate choice occurred for hemogenic fate 500 settling (Fig. 6a). Serving as the two presumed final fates of earlyAEC, HEC and lateAEC 501 displayed a series of differences (Fig. 1i), which better explains the presumably 502 misinterpreted notion in previous report that arterial ECs and HSCs originate from distinct 503 precursors 18 . Our findings further emphasize that arterial specification and Notch signaling 504 should be precisely and stepwise controlled for HSC generation. Although both showing 505 obvious similarity regarding the arterial feature and anatomical distribution, the difference 506 between earlyAEC and lateAEC should also be paid attention to as the former but not the 507 latter is the direct origin of the HSC-primed HECs. 508

509
It is generally accepted that haematopoietic cells in the IAHCs are proliferative 48,49 , within 510 which pre-HSCs are mainly involved 14,49 . Supportively, enriched functional T1 pre-HSCs 511 manifested a relatively proliferative status 11 . On the other hand, slow cycling is witnessed at 512 the base of IAHCs 49 , and it is suggested that exit from cell cycle is necessary for HEC 513 development and endothelial-to-haematopoietic transition 44,50 . Nevertheless, based on the 514 precise recognition of the HSC-primed HECs here, we showed that proliferation was 515 gradually decreased upon arterial specification and maturation, whereas re-activated once 516 the arterial ECs chose a hemogenic fate featured by the simultaneous Runx1 expression (Fig.  517 6e-g). The functional requirement of cell cycle control for the specification of the HSC-primed 518 HECs needs to be investigated, which would depend on the initiating cell populations. 519 520 We also revealed several similarities regarding the molecular events underlying the 521 except that the concentration of staining solution was 5 mol/L and the time of staining was 3 545 minutes before washed. Primary embryonic single-cell suspension was acquired by type I 546 collagenase digestion. 547 548

592
To investigate the HSC potential of the CD41 -CD43 -CD45 -CD31 + CD44 + Neurl3-EGFP + 593 population in E10.0 caudal half, male Neurl3 EGFP/+ reporter mice (CD45.2/2 background) were transplantation strategy were same as mentioned above except that the recipients were 596 female 8-12 weeks CD45.1/2 mice and the carrier cells were obtained from CD45.1/1 mice. 597 598 Immunofluorescence 599 Embryos were isolated, fixed with 4% paraformaldehyde for 30 minutes to 2 hours at 4°C, 600 embedded in paraffin, and sectioned at 5-6 μ m with Leica RM2235. Sections were 601 deparaffinized with ethanol of gradient concentration, then blocked in blocking solution 602 (Zhongshan golden bridge) for 30 minutes at room temperature, followed by incubation with 603 primary antibodies overnight at 4°C. After 3 washes (3 minutes each) in PBS, sections were 604 incubated with corresponding secondary antibodies (Zhongshan golden bridge) for 30 605 minutes at room temperature. After 3 washes in PBS, sections were stained with 606 DendronFluor TSA (Histova, NEON 4-color IHC Kit for FFPE, NEFP450, 1:100, 20-60-sec). 607 The primary and secondary antibodies were thoroughly eluted by heating the slides in citrate 608 buffer (pH 6.0) for 10 minutes at 95°C using microwave. In a serial fashion, each antigen was 609 labeled by distinct fluorophores. After all the antibodies were detected sequentially, the slices 610 were finally stained with DAPI. Images were collected by confocal microscope (Nikon Ti-E 611 for 10-30 minutes, and hydrogen peroxide was added to 0.03%. The enzymatic reaction was 627 allowed to proceed until the desired color intensity was reached, and the samples were rinsed 628 3 times in PBST. Finally, the samples were dehydrated in 100% methanol and soaked in 629 sample-specific barcode whereas N standing for unique molecular identifiers, UMI, see Table  638 S7) and TSO primer for template switching 56 57 58 . After reverse transcription and second-strand cDNA synthesis, the cDNAs were amplified by 17 cycles of PCR using ISPCR 640 primer and 3' Anchor primer (see Table S7). Up to 56 samples were pooled and purified using 641 Agencourt AMPure XP beads (Beckman). 4 cycles of PCR were performed to introduce index 642 sequence (see Table S7). After this step, 400 ng cDNAs were fragmented to around 300 bp 643 by covaris S2. The cDNA was incubated with Dynabeads MyOne TM Streptavidin C1 beads 644 (Thermo Fisher) for 1 hour at room temperature. Libraries were generated using KAPA Hyper 645 Prep Kit (Kapa Biosystems). After adaptor ligation, the libraries were amplified by 7 cycles of 646 PCR using QP2 primer and short universal primer (see Table S7). The libraries were 647 sequenced on Illumina HiSeq 4000 platform in 150bp pair-ended manner (sequenced by 648 Novogene). 649 650

Quantification of gene expression for scRNA-seq data 651
We used unique molecular identifier (UMI)-based scRNA-seq method to measure the gene 652 expression profiles within individual cells. Raw reads were firstly split by specific barcode 653 attached in Read 2 for individual cells and UMI information was aligned to the corresponding 654 Read 1. Read 1 was trimmed to remove the template switch oligo (TSO) sequence and polyA 655 tail sequence. Subsequently, quality control was conducted to discard reads with adapter 656 contaminants or low-quality bases (N > 10%). Next, the mm10 mouse transcriptome (UCSC) 657 was used to align the clean reads using TopHat (version 2.0.12) 59 . Uniquely mapped reads 658 were obtained using HTSeq package 60 and grouped by the cell-specific barcodes. 659 Transcripts of each gene were deduplicated based on the UMI information, while 660 mitochondrial genes were not included for quantification. Finally, for each gene in each 661 individual cell, the number of the distinct UMIs derived from that gene was regarded as its 662 copy number of transcripts. control criteria and normalization method described above were applied to these additional standards and were used for downstream analyses (see Table S1). 684 685

Dimensional reduction and clustering 686
We used Seurat R package 61 (version 2.3.4) for further analyses and exploration of our 687 single cell RNA sequencing data, such as identification of highly variable genes (HVGs) and 688 differentially expressed genes (DEGs), dimension reduction using PCA or t-SNE, 689 unsupervised clustering and so on. A standard analysis process is briefly described below. 690 First, only genes expressed in at least 3 single cells were retained so as to exclude genes 691 that were hardly expressed. Then, FindVariableGenes function was used to select HVGs on 692 log2 (TPM/10+1) transformed expression values. Genes with average expression more than 693 1 and less than 8 and dispersion greater than 1 were identified as HVGs. To mitigate the 694 effect of cell cycle, HVGs not included in the direct cell cycle GO term (GO:0007049) ( Table  695 S7) were used as inputs for PCA dimension reduction. Elbow method was employed to select 696 the top relevant PCs for subsequent t-SNE dimension reduction and graph-based clustering 697

. 698 699
For the initial dataset from E9.5-E11.0 body and DA locations, we select top 15 PCs for 700 clustering using FindClusters with default settings, to obtain 6 major clusters. Negative 701 control cells with a non-EC immunophenotype and cells grouped with these negative control 702 cells were reclassified specifically into the Neg cluster. The remaining cells were assigned as 703 vEC, earlyAEC, lateAEC, HEC and HC clusters based on the clustering results. Next, cells in the filtered initial dataset was used for analyses of subdatasets, including subdividing of HEC 706 cluster, subdividing of eaAEC cluster and in-depth analyses of earlyAEC, lateAEC and HEC 707 clusters. The filtered initial dataset was also included in three combined datasets of 708 combining PK44 cell population, PK44 and T1 pre-HSC cell populations, and PK44 and 709 Neurl3-EGFP cell populations, respectively. Dimension reduction and clustering analyses for 710 subdatasets and combined datasets abovementioned also followed the same procedure as 711 described above. See Table S1 for detailed cell information. For cell cycle analysis, cell cycle-related genes consisting of a previously defined core set of 747 43 G1/S genes and 54 G2/M genes were used 58,65 (see Table S7 for detailed gene lists). We 748 used a way similar to Tirosh, et al. 66 to classify the cycling phases of the cells. We calculated 749 the average expression of each gene set as corresponding scores, and manually assigned 750 cells to approximate cell cycle phases based on the scores. Namely, cells with G1/S score < 751 2 and G2/M score < 2 were assigned as 'quiescent', otherwise 'proliferative'. Among 752 proliferative cells, those with G2/M score > G1/S score were assigned as 'G2/M', and those 753 with G1/S score > G2/M score were assigned as 'G1' when G2/M score < 2, or as 'S' when 754

Constructing single cell trajectories 757
Monocle 2 67 (version 2.6.4) and Mpath 68 (version 1.0) were adopted to infer the development 758 trajectory of selected cell populations. Monocle 2 can construct single-cell trajectories and 759 place each cell at its proper position in the trajectory, even a "branched" trajectory 760 corresponding to cellular "decisions". We followed the official vignette with recommended 761 parameters. Briefly, UMI count data of given cell populations was used as input and genes 762 with more than 1.5 times of fitted dispersion evaluated using dispersionTable function were 763 identified as HVGs. To reduce the influence of cell cycle effect, HVGs not included in the 764 direct cell cycle GO term (GO:0007049) were retained as ordering genes for the subsequent 765

Patterns of DEGs among multiple clusters 774
In the case of identification of gene patterns in more than two clusters, analysis of variance 775 followed by Tukey's HSD test for pairwise comparison was adopted to identify DEGs (genes 776 with adjusted P value < 0.05 and fold change > 2 or < 0.5). For identification of patterns in included in these patterns were combined as input for construction of "signed hybrid" 785 weighted gene co-expression network analysis using WGCNA 69 . Next, we used 0.01 as 786 adjacency threshold for including edges in the output to export network, which was then 787 imported into Cytoscape 70 for visualization. We also calculated Pearson correlation coefficient 788 between each transcription factor and the pattern it belongs to. 789 790 For identification of patterns in pEC, pAEC, earlyAEC and HEC clusters, all 2,851 DEGs 791 among them were retained. We used ConsensusClusterPlus function with k-means algorithm 792 on top 500 DEGs to achieve five stable clusters. Then all DEGs were reassigned into one of 793 the five patterns according to which pattern had maximum average Pearson correlation 794 coefficient with a given DEG. Note that we used a downsampled dataset in the visualization 795 related to the five patterns in order to show more detailed changes along the development 796 trajectory. Sixty cells were randomly sampled from pEC and pAEC clusters, respectively. 797 798

Identification of HEC signature genes 799
Firstly, we compared HEC to every cluster to get the overrepresented genes, which were 800 upregulated across each of the other 4 clusters (vEC, earlyAEC, lateAEC and HC) within 801 filtered initial dataset. To make sure the accuracy of HEC overrepresented genes, we used 802 both wilcox and roc method to perform the DEG analysis. Only the DEGs identified by both 803 methods were regarded as HEC overrepresented genes. Finally, 25 cluster HEC 804 overrepresented genes were retained. In order to not only identify the endothelium with 805 hemogenic potential, but also discriminate those HSC-primed hemogenic ECs from yolk 806 sac-derived early haematopoietic populations such as erythro-myeloid progenitors, genes 807 highly expressed in erythro-myeloid progenitors (Gsta4, Spi1, Alox5ap and Myb) as reported 808 71 72 were excluded. In addition, genes not highly expressed (log 2 (TMP/10+1) < 2) in HEC or 809 highly expressed (log 2 (TPM/10+1) > 2) in every clusters were also excluded. Finally, eleven 810 HEC overrepresented genes were retained as HEC signature genes. 811 812 SCENIC analysis 813 SCENIC 73 could reconstruct gene regulatory networks from single-cell RNA-seq data based 814 on co-expression and DNA motif analysis. Here, we used SCENIC R package (version 40 1.1.1-9) to identify refined regulons, each of which represented a regulatory network that 816 connects a core TF with its target genes. We followed the "Running SCENIC" vignette in the 817 R package with default settings. We identified 507 unique regulons, among which 75 818 regulons significantly overlapped with the 2851 significantly changed genes were retained. 819 Fisher's exact test was employed for estimate the statistical significance of their overlaps. 820 The 75 core TFs were considered as putative driving force to orchestrate the dynamic 821 molecular program during HEC specification, given the simultaneous co-expression of the 822 core TF and its predicted targets in a given regulon. 823

Gene set variation analysis 825
Through gene set variation analysis, gene-level expression profiles could be transformed into 826 pathway-level enrichment score profiles using GSVA R package 74 coupled with KEGG 827 pathways 75 . We used ssgsea method 76 to estimate gene-set enrichment scores of each cell. 828 Two-sample Wilcoxon test was employed to find differentially enriched pathways between 829 involved clusters. Adjusted P value < 0.05 was considered statistically significant.  Table S7 for the detailed gene lists. 835 All statistical analyses were conducted in R version 3.4.3. Two-sample Wilcoxon Rank Sum 838 test was employed for comparisons of gene numbers, transcript counts, or gene expression 839 levels between two clusters of cells. We referred to statistically significant as P < 0.05 (if not 840 specified). Network enrichment analyses and gene ontology biological process enrichment 841 analyses were performed using Metascape 79 (http://metascape.org) and clusterProfiler 80 , 842 respectively. 843 844

Data and Code Availability 845
The scRNA-seq data has been deposited in the NCBI Gene Expression Omnibus, the 846 accession number for the data is pending. Code is available on reasonable request. 847            Hlf