Finding cell-specific expression patterns in the early Ciona embryo with single-cell RNA-seq

Single-cell RNA-seq has been established as a reliable and accessible technique enabling new types of analyses, such as identifying cell types and studying spatial and temporal gene expression variation and change at single-cell resolution. Recently, single-cell RNA-seq has been applied to developing embryos, which offers great potential for finding and characterising genes controlling the course of development along with their expression patterns. In this study, we applied single-cell RNA-seq to the 16-cell stage of the Ciona embryo, a marine chordate and performed a computational search for cell-specific gene expression patterns. We recovered many known expression patterns from our single-cell RNA-seq data and despite extensive previous screens, we succeeded in finding new cell-specific patterns, which we validated by in situ and single-cell qPCR.

www.nature.com/scientificreports www.nature.com/scientificreports/ Assessment of technical variability and reproducibility. Our results show limited technical variation within each batch: the expression levels in different cell types from the same embryo are well correlated for embryos 2, 3 and 4. They are, in fact, more similar to each other than the same cell types are across different individuals ( Fig. 2a-b;→, Supplementary Fig. 1). Although we cannot separate out the sources of cross-embryo variation, this result is consistent with a previous report showing that maternal mRNA levels vary significantly between unfertilized eggs from different individuals 28 . It is also worth noting that very little of the variation between embryos is from the sequencing run. This is consistent with previous results showing high correlation between expression measurements from tens of millions of reads per cell and those from lower coverage of a million or fewer reads 29,30 .
This embryo batch effect is further demonstrated by a Principal Components Analysis (Fig. 2b), which shows a similar result with the cell types of embryos 2, 3 and 4 being close to each on the first two components (which explain 56% of the variance) and the cell types of embryo 1 being more spread out. Embryo 1 belonged to a healthy developmental batch (Supplementary Table 2) and we suggest that its outlier status is due to technical variability introduced during library preparation, specifically cDNA synthesis and amplification.
The close clustering of cells from the same embryo, as well as their high correlation, suggests that our technical variability is low, leading to reproducibility within each batch (or embryo). A confirmation of the reproducibility of our results is the tight distribution of genes detected across samples within embryos ( Supplementary Fig. 1). Genes were considered detected when the measured count was greater than zero. These results show that slightly more genes were detected on HiSeq than MiSeq, but that the median difference for each embryo is less than 10%. This can be compared with a previous result showing a reduction of genes detected of around 39% when lowering sequence coverage to less than a million reads per cell 29 . As before, embryo 1 showed more variability across samples than the other embryos.
Further, our data can be compared to previously published data for the 16-cell stage that was generated using gene expression microarrays from many pooled single cells 28 . We found good agreement with our scRNA-seq expression patterns for the key genes shown in their paper (Fig. 2c).

Normalisation of counts.
After producing a dataset of counts, we normalised our data. Since we were interested in gene expression differences between different cell types, we did not normalize across genes (such as by GC content or transcript length), but only for sequencing depth by dividing by the total number of reads per sample. More complex normalisation steps could be considered 31 , but this simple approach suffices for our data. It gives a natural measure of expression for each gene, namely the proportion it contributes to the total, which we assume is independent of the total number of reads.
We then made use of a suitable transformation of proportions, the arcsine of the square root (referred to here as the ϕ transformation), namely: where k ij is the count for the i th gene and j th sample and N j is the total number of counts for the j th sample. The difference between ϕ i of two samples can be interpreted as an effect size index for proportions, namely Cohen's h 32 . In practice, the arcsine function in the ϕ transformation is largely redundant because most genes are expressed at low proportions. At these low values, the arcsine transformation is close to identity, meaning that a square root transformation of proportions performs equivalently in many cases.
Finding putative gene expression patterns. We then took a simple and effective approach to find gene expression patterns. Instead of grouping cells into a single set of clusters, we clustered the gene expression pattern of each gene separately (Fig. 3). For each gene, we grouped the different cell types into two classes of ON and OFF expression. We took advantage of our known cell types and calculated the Euclidean distance between vectors of replicate measurements for each cell and performed single-linkage hierarchical clustering of these (bottom-up, agglomerative clustering of the cells). The resulting two top-level clusters determined the ON/OFF pattern of each cell type for each gene (Fig. 3).
We then ranked these results, by taking an approach that does not require parametric estimation of variation or dispersion. We calculated our cluster reliability score as the difference between the first quartile of the ON cluster and the third quartile of the OFF cluster, which we call the Transquartile Range (TQR). The TQR is larger when the difference in cluster means is larger, but it penalizes higher variation for a given difference in means. Ranking cell-specific gene expression patterns. We applied this approach and produced a list of ranked genes and their expression patterns. We examined our results for 77 genes (Supplementary Table 1) with known in situ patterns 28,33-35 . Our top 35 results matched the known in situ patterns ( Supplementary Fig. 2a), except for KH.L152.12, which is validated below as a new pattern by in situ and qPCR (see below). However, the lower ranked results did not correspond well ( Supplementary Fig. 2b).
There were a few reasons for this. For example, clustering did not produce a correct pattern for Lefty ( Supplementary Fig. 2b). The cell types, A5.2 and B5.1, are normally considered part of the Lefty pattern, but they have intermediate expression in our scRNA-seq results and, in this instance, the clustering algorithm places them in the OFF cluster. For a few other genes, e.g. DPOZ (KH.C12.589) and Dlx.b, the clustering is correct, but the TQR is low. For a few other genes, no reads were mapped, e.g. Sox7/17/18, Fringe 2 and KH.C13.22, but in most cases where our scRNA-seq data does not agree with published in situ patterns, our expression measurements were low or relatively uniform across the eight cells and thus the algorithm functions correctly in attributing lower score to these results. Many of these genes are expressed maternally in the Ciona embryo and are de-adenylated during the maternal-to-zygotic transition 36,37 ; thus they might not be easily detected by our RNA-seq protocol, which amplifies from the poly(A) tail of mRNA. In contrast, in situ hybridisation can detect localised maternal signal in the cytoplasm, which could explain the discrepancy. It is also possible that some in situ results are false positives.
Validating cell-specific gene expression patterns. Using these known in situ results, we assessed that the top 40 ranked results from all genes were likely to be reliable and focused on these for further validation (Fig. 4). We found 12 distinct patterns in the top 40, which included all known patterns as well as three potentially new patterns (highlighted in orange in Fig. 4). Ten patterns are currently known at this stage in Ciona 28,33,[38][39][40][41][42] , and although we matched only nine of these directly (Fig. 1b), the tenth pattern has a single known case, Tfap2-r.b (AP-2-like2), which we did pick up, but without expression in A5.2. This agrees with previous observations that www.nature.com/scientificreports www.nature.com/scientificreports/ expression is not consistent in this cell across embryos 28,33 . Further, it is in agreement with the average over many embryos as measured by microarray 28 .
We validated one of the potentially new patterns, namely the pattern for KH.L152.12, by in situ in biological duplicates and single-cell qPCR in triplicate (Fig. 5e). This observation does not necessarily negate previously published ones for KH.L152.12, but rather highlights the value of our algorithm in identifying new patterns from scRNA-seq data comprising just four embryos. We could not validate the second pattern for KH.C1.933 that has relatively ubiquitous expression, although with apparently reduced expression in b5. 4.
In addition to recovering all known patterns and validating one novel pattern, we found at least 28 genes with known in situ patterns. Of the 12 patterns we found, the pattern with expression in the B5.2 cell only was the most represented. This is also the most frequent pattern in known in situ patterns, i.e. postplasmic/PEM RNAs 35 . The majority of our results for B5.2 are confirmed by previous in situ datasets, but we identified new B5.2-specific genes, such as KH.C13.98 and KH.C12.212, confirming their expression by in situ and single-cell qPCR (Fig. 5a). www.nature.com/scientificreports www.nature.com/scientificreports/ We also validated other classes of uncharacterized genes, namely KH.S1497.1, which expresses specifically in the animal hemisphere, and KH.C11.529 on the anterior side ( Fig. 5c-d). These results are particularly striking, since it was expected that no further zygotic gene expression patterns would be found at this stage in Ciona 23 .
We also looked more widely in the top 60 ( Fig. 4, Supplementary Fig. 3) and validated additional genes, KH.C8.450, KH.C9.289 and KH.C4.260, by single-cell qPCR and in situ hybridisation ( Fig. 5a-b). The first is another example of B5.2 expression, whereas the last two genes are expressed in all cells except B5.2, a pattern known previously from Hes.a 43 . While developing our approach, we also applied the TQR to the microarray data of this stage 28 and looked for reliable B5.2 cell-specific expression using both datasets. As a result, we also found and validated KH.L60.2 and KH.C14.501 (ranked 88 and 2403 in our data; see Supplementary Table 7).
In conclusion, we have recovered many known patterns, as well as patterns and genes that had not been detected previously despite extensive in situ screens. These results open up opportunities for further research into developmental patterning in Ciona. In addition, we have demonstrated that single-cell RNA-seq is a viable alternative to extensive in situ screens, offering a promising approach for finding genes with cell-specific expression in less well-studied organisms. Table 2). Early ascidian embryos are thought to be bilaterally symmetrical, but to avoid any potential bilateral variation in our small sample, we collected eight cells from the right side of each embryo. The cells were collected individually in batches of eight cells from the same embryo on the same day, with sequencing libraries prepared in parallel, barcoded and then sequenced together. This means that biological variation between embryos and technical variation between batches cannot be distinguished. The advantage of this design is that it minimizes technical variation between cell types of the same embryo and controls for confounding technical and biological variation between embryos. Averaging across the cell types of different batches reduces this unwanted variation, maintaining cell-specific variation. Our results show that cells from the same embryo are more similar to each other than the same cell types are across individuals, with a similar number of genes detected per cell type ( Fig. 2a-b, Supplementary Fig. 1).

KH.C8.450
Animal Vegetal  www.nature.com/scientificreports www.nature.com/scientificreports/ Preparation of Ciona embryos. Ciona intestinalis type A, recently designated Ciona robusta 44,45 , adults were obtained from Maizuru Fisheries Research Station (Kyoto University) and Misaki Marine Biological station (The University of Tokyo) under the National Bio-Resource Project for Ciona. They were maintained in an aquarium in our laboratory at Okinawa Institute of Science and Technology Graduate University under constant light (Calcitrans, Nisshin Marinetech Co., Ltd.) for three days apart from a few hours of darkness a day with feeding to induce spawning of the old eggs. After this, the Ciona were maintained under constant light to induce oocyte maturation. Eggs and sperm were obtained surgically from the gonoducts. Embryos were dechorionated after insemination using a solution of 0.07% actinase and 1.3% sodium thioglycolate. Eggs were reared to reach the 16-cell stage in Millipore-filtered seawater (MFSW) at about 18 °C. Embryos from each insemination batch were kept to check the ratio that developed into morphologically normal tailbud. We only used embryos from batches where more than 70% developed normally to tailbud (10 hours post fertilization at 18 °C) (see Supplementary  Table 2 for embryo batch information). NEB Next ® ChIP-Seq Library Prep Master Mix Set for Illumina ® (E6240, NEB) was applied to sheared cDNA for preparation of the library for the Illumina platform. NEBNext ® Multiplex Oligos for Illumina (E7335, E7500, NEB Next Multiplex Oligos for Illumina, NEB) were combined to introduce an index and adaptor to the double-stranded DNA. After extraction of the 300 bp fraction of adaptor-ligated DNA by E-Gel Size Select 2% Agarose (G661002, Invitrogen), DNA was amplified with individual index primers using PCR with 19 cycles.

Naming of cells. In
The amplified DNA fragment composition was purified with Agencourt AMPure XP twice (A63881, Beckman) and again checked by Qubit (> 60 ng of cDNA in total yield) and by Bioanalyzer to ensure that the fragment size was sharply distributed around 300 bp (on average, about 320 bp with a standard deviation of 40). The concentration of fragments with appropriate index adapters was quantified by KAPA Library Quantification Kits (KAPA Library Quantification Kits, Illumina GA/Universal, KK4825, Genetics) to ensure that the final libraries had adapters for both ends and their concentration was at least 20 pM.  Figs. 1, 3 and 4).
The resulting reads were aligned using Bowtie 48 version 2.2.6 to the Ciona KH genome assembly 49,50 , downloaded from Ghost (http://ghost.zool.kyoto-u.ac.jp/download_kh.html). Reads were mapped using local alignment (-local), with other settings at their default. We did not trim or filter reads, but instead made use of local alignment to find the optimal match. This had the additional benefit that we did not need to split up reads to handle transcripts spanning more than one intron, as is done, for example, in TopHat 51 . Gene counts were calculated from the resulting alignment files using htseq-count 52 with the non-stranded option and mode "intersection-nonempty" against the KH gene models (version 2013) downloaded from Ghost.
We assessed our samples for mapping quality. We excluded one embryo from subsequent analysis since it had oligo-dT primer sequence in more than 50% of its read pairs; the remaining four embryos had less than 1% of read pairs affected. All remaining samples mapped well to the genome (Supplementary Table 3) and a uniform number of genes were detected (about 60%), although embryo 1 had noticeably fewer detected genes for some of its cells.