Genome-wide identification and developmental expression profiling of long noncoding RNAs during Drosophila metamorphosis

An increasing number of long noncoding RNAs (lncRNAs) have been discovered with the recent advances in RNA-sequencing technologies. lncRNAs play key roles across diverse biological processes, and are involved in developmental regulation. However, knowledge about how the genome-wide expression of lncRNAs is developmentally regulated is still limited. We here performed a whole-genome identification of lncRNAs followed by a global expression profiling of these lncRNAs during development in Drosophila melanogaster. We combined bioinformatic prediction of lncRNAs with stringent filtering of protein-coding transcripts and experimental validation to define a high-confidence set of Drosophila lncRNAs. We identified 1,077 lncRNAs in the given transcriptomes that contain 43,967 transcripts; among these, 646 lncRNAs are novel. In vivo expression profiling of these lncRNAs in 27 developmental processes revealed that the expression of lncRNAs is highly temporally restricted relative to that of protein-coding genes. Remarkably, 21% and 42% lncRNAs were significantly upregulated at late embryonic and larval stage, the critical time for developmental transition. The results highlight the developmental specificity of lncRNA expression, and reflect the regulatory significance of a large subclass of lncRNAs for the onset of metamorphosis. The systematic annotation and expression analysis of lncRNAs during Drosophila development form the foundation for future functional exploration.

The rapid accumulation of data from tilling arrays and high-throughput sequencing indicates that most of a eukaryotic genome is actively transcribed. Many of these transcribed RNAs are untranslatable into proteins and are called noncoding RNAs (ncRNAs). More than half of the mammalian transcriptome is comprised of ncRNAs 1 . These transcripts consist of small ncRNAs (including microRNAs and piRNAs) and long ncRNAs (lncRNA, > 200 nucleotides). There is increasing in vitro and in vivo evidence that lncRNAs play key roles across diverse biological processes [2][3][4][5] , although functions of lncRNAs are largely unknown 6,7 . For example, lncRNAs are involved in dosage compensation, genomic imprinting, and regulation of epigenetic marks and gene expression [8][9][10] . Many lncR-NAs are associated with human disease 3,10-13 . Therefore, the discovery and annotation of lncRNAs have attracted increasing attentions in recent years 7 .
Intriguing studies on individual lncRNAs have revealed distinct functions for different lncRNA families. One of the major functions mediated by lncRNAs is related to developmental regulation for pluripotency, differentiation, specialization and homeostasis 3,14,15 . Several findings that support the biological significance of lncR-NAs demonstrate that many lncRNAs exhibit tissue-or cell type-specific expression [16][17][18][19] . Some lncRNAs exhibit expression at specific developmental time point 20 , and thus can be only discovered during narrow developmental time windows. These findings suggest that the complex task involved in developmental regulation requires the fine timing, space, and rate of expression of specific lncRNAs 2 . However, knowledge about how the genome-wide expression of lncRNAs is developmentally regulated is still limited. A systematic profiling of the temporal or spatial expression of lncRNAs during development undoubtedly will facilitate such understanding.
D. melanogaster is a cheaper and more amenable genetic model organism for the functional study of lncRNAs compared with mammal models. Drosophila also has many benefits for the evolutionary and experimental investigation of lncRNA loci 21 . Many Drosophila lncRNAs have been predicted [21][22][23][24][25] . Several lncRNAs have been functionally characterized [26][27][28] , and found to be associated with X inactivation 29 , behaviors 30 , and neuronal disease 31 . However, a systematic identification of lncRNAs from exhaustive transcriptome datasets and global expression profiling of lncRNAs during development processes is still lacking, thus impeding the functional and mechanistic exploration of lncRNAs in D. melanogaster.
The annotations of many Drosophila transcriptomes from modENCODE (www.modencode.org) and Flybase (www.flybase.org) have included expression data at all developmental stages 32,33 . These databases provide valuable sources for the systematic catalog of lncRNA genes as well as the developmental profiling of their expression. In this study, we applied integrated methods to discover D. melanogaster lncRNAs and verified the predicted lncR-NAs with bioinformatic and experimental approaches. The expression profiles of lncRNAs in 27 developmental processes in D. melanogaster were analyzed. The results showed high developmental specificity of lncRNA expression and extensive involvement of a large subclass of lncRNAs during Drosophila metamorphosis. The systematic annotation and expression profiling of Drosophila lncRNAs highlight the fundamental importance of lncRNA enrichment for developmental regulation and lay the foundation for future functional and evolutionary studies.

Results
Drosophila lncRNA identification and validation. We identified putative lncRNAs from D. melanogaster transcriptomes deposited in modENCODE. The database contains 43,967 transcripts from both poly(A) and total RNAs that are expressed at 27 distinct developmental stages from embryo to adult 32 . We first filtered the transcriptomes of transcripts that are shorter than 200 bp in total length or predicted to harbor maximum open reading frames (ORFs) encoding longer than 100 amino acids (Fig. 1a). Protein coding potential of transcripts was evaluated by TransDecoder (http://transdecoder.github.io). We then used the software CPAT to predict lncRNA, which is expected to yield more than 96% sensitivity and specificity 34 . As a result, we identified 1,077 putative lncRNAs (> 200 base pair, or bp) in D. melanogaster. We last evaluated the false positive rate of prediction based on protein annotation in Flybase (http://flybase.org/). Blast against the Drosophila protein-coding transcript sequences (CDS) showed that 9% of the putative lncRNAs, i.e., 101 transcripts, were annotated as CDS in Flybase, even though the coding proteins of most (62%) of the CDS have not been reported. Therefore, only the remaining 976 lncRNAs were accepted and used for the following examination (Supplementary data 1). The predicted lncRNAs are distinct from CDS in coding probability ( Supplementary Fig. S1). Among these lncRNAs, 330 transcripts have been annotated as lncRNAs in modENCODE, Flybase, or previous publications 21, 23 consisting of 34% of all lncRNAs predicted. The 646 Drosophila lncRNAs are therefore putatively novel.
The predicted lncRNAs had an average length of 621 bp. Most (87%) of the lncRNAs were less than 1,000 bp long ( Fig. 1b and Supplementary data 1). The averaged size of CDS is 11,254 bp. The size of lncRNAs is much shorter than that of CDS. The averaged size of the 330 lncRNAs is 3,490 bp, and that of the 646 novel lncRNAs is 1,099 bp (Supplementary data 1). The 330 lncRNAs are longer than the 646 lncRNAs mainly because a large proportion (47%) of the novel lncRNAs range between 200-500 bp; In contrast, 39% of the annotated lncRNAs range between 500-1,000 bp, and 19% are longer than 3,000 bp ( Supplementary Fig. S2a). The CDS have 6.  Supplementary Fig. S2b). The results indicate that we have identified many previously annotated lncRNAs as well as a subset of novel lncRNAs that are distinct in gene structure in D. melanogaster.
To verify that these lncRNAs were transcribed RNA but not artificial products or amplified from genomic DNA or contaminants, 20 lncRNAs, including 12 intergenic lncRNAs, were randomly selected from the novel lncRNAs and experimentally validated by RT-PCR. Among the 20 lncRNAs tested, 19 yielded PCR products of the right size and correct sequence from their cDNA templates (Fig. 1c). The 12 intergenic lncRNAs were randomly selected from the 19 lncRNAs and further tested by RNA dot hybridization. Eleven lncRNAs exhibited clear RNA blot with antisense probe but not with sense probe, i.e., the negative control (Fig. 1d). The genomic structures of the 12 lncRNAs were annotated based on Flybase ( Supplementary Fig. S3). Detailed analysis on one intergenic lncRNA, i.e., chr2R_135227_135517, revealed its prominent expression in many developmental stages, even though the expression level was low compared with that of the neighboring coding genes (Fig. 1e).
Developmental specificity of lncRNA expression in D. melanogaster. We examined the possible functions of the 976 lncRNAs in developmental regulation by analyzing their developmental profiling and specificity of expression at different stages. We investigated the pattern of gene expression changes with embryonic development at 0 h to 24 h and then compared the expression profiles between lncRNAs and protein-coding loci (Fig. 2a). Some lncRNAs were highly expressed in one of the 12 time points of embryonic development. For example, 133 lncRNAs consisting of 20.7% of the total expressed lncRNAs were specifically highly expressed at the 22 h to 24 h embryonic stage, the critical time point close to the metamorphosis from embryo to larva 35 . This percentage of lncRNAs was significantly higher than that (5.2%) of protein genes highly expressed at this specific stage (Fisher's exact test, P < 3.6 × 10 −15 ). The majority (69%) of expressed lncRNAs with 454 transcripts demonstrated a more developmental stage-specific expression, relative to 39% of the protein-coding RNAs (Fisher's exact test, P < 3.4 × 10 −14 ) (Fig. 2a). The exclusion of 941 housekeeping gene transcripts (Supplementary data 2) from CDS didn't alter the proportion (i.e., 38.6%) of CDS exhibiting developmental-specific high expression ( Supplementary Fig. S4). To further test whether or not lncRNAs have a more temporally restricted expression than coding genes, we introduced a measure of expression level divergence over developmental time, i.e., a Shannon entropy-based specificity score per locus (JS) 20 . The JS value ranged from 0 to 1, and the expression pattern with JS = 1 indicated perfect specificity. To rule out that the different specificities resulted from the low expression level of lncRNAs,  we compared the specificity values of lncRNAs at different expression levels. The developmental specificities of lncRNA expression at different levels were compared in two ranges of maximal JS specificity score. The JS difference between lowly and moderately expressed lncRNAs was significant at JS < 0.9 (Mann-Whitney U Test, P < 2.3 × 10 −12 ). The difference between moderately and highly expressed lncRNAs was also significant (P < 9.6 × 10 −15 ). The proportion of lncRNAs with extreme specificity, with JS in the range of 0.9 to 1.0, was also compared. Approximately 26.5% lowly expressed lncRNAs, 3.8% moderately expressed lncRNAs, and 1.1% highly expressed lncRNAs exhibited extreme temporal-restricted expression. A significant difference in the proportion was detected between lowly and moderately expressed lncRNAs (Fisher's exact test, P < 6.5 × 10 −12 ) but not between moderately and highly expressed lncRNAs (P = 0.071) (Fig. 2b). The results suggest that the lowly expressed lncRNAs significantly contribute to the extreme JS value. Therefore, to compare the expression specificities of lncRNAs and coding RNAs, we excluded the lowly expressed lncRNA, and only compared the coding RNAs and 331 highly expressed lncRNAs in embryos. A significant difference in developmental specificity was observed between the coding genes and highly expressed lncRNAs (Mann-Whitney U Test, P = 0.039) (Fig. 2c).
The developmental specificity of lncRNA expression was also examined at the larval stage (Fig. 3). The majority (79%) of the lncRNAs had high stage-specific expression. By contrast, half (50%) coding genes were specifically highly expressed (P < 3.9 × 10 −14 ). The percentage of highly expressed genes (15,585 out of 31,212) remained unchanged after the housekeeping genes were excluded for the analysis (Supplementary Fig. S4). Approximately 42% lncRNAs were specifically highly expressed at the last larval stage, the critical time for the transformation from larval to prepupal stage (Fig. 3a). The maximal JS specificity of the coding genes was obtained at 0.36, whereas that of the lncRNAs was obtained at 0.5. Up to 11% of the lncRNAs reached an extreme specificity, which was significantly higher than that (4%) of the protein-coding genes (P < 4.1 × 10 −11 ) (Fig. 3b). The developmental specificity of the lncRNA expression observed at the pupal stage was not as prominent as that at the embryonic and larval stages (Supplementary Fig. S5). No significant difference in the maximum JS specificity was detected between the lncRNAs and coding RNAs (Supplementary Fig. S5). Discussion An increasing number of ncRNAs have been discovered with the deeper-sequencing of transcriptomes and the accumulation of transcript data. Thousands of ncRNA loci (both small and long RNAs) have been suggested to exist in the D. melanogaster genome 21,22,25,36 . In the present study, we combined high-accuracy prediction of lncR-NAs with a stringent filtering of putative protein-coding transcripts to define a high-confidence set of lncRNAs. As lncRNAs typically present a cell type-specific expression as well as a lower expression than protein-coding genes, we searched lncRNAs in a rich source of transcriptomes that are originated from 27 developmental stages. We identified 1,077 lncRNAs from 43,967 transcripts, among which 646 lncRNAs have not been previously reported. These lncRNAs were missed from previous annotation mainly because different methods for ncRNA annotation had been used. For example, we used CPAT program for ncRNA prediction, which relies on a logistic regression model built with sequence features 34 . Many novel lncRNAs are short and located in intergenic region. However, previous methods identified ncRNA mainly based on the annotation of gene structure/gene model in flybase 32 . Experimental validations further confirm the reliability of the identified lncRNAs. These lncRNAs provide a high-quality resource of lncRNAs for future functional and evolutionary studies. Nevertheless, more lncRNAs in D. melanogaster are expected to be discovered with deeper-sequencing in different tissues and in individuals exposed to different environmental conditions.
Analysis of the developmental in vivo expression profile of Drosophila lncRNAs highlighted the high developmental specificity of lncRNA expression. The expression profiling of the 976 lncRNAs showed that lncRNAs were expressed in a much narrower time window than protein-coding genes. Thus, the former was more highly temporally restricted than the latter. The developmental specificity of expression was demonstrated at the embryonic, larval, and pupal stages examined. Recent results prove that lncRNAs have important functions in cell fate decision, differentiation, migration and signaling 6,18,20,37,38 . Our findings suggest that lncRNAs could be involved in regulation of the timing of developmental transition.
An interesting finding revealed in the study is that a large subclass of lncRNAs are prominently upregulated at the developmental stage of metamorphosis. Twenty-one percent of lncRNAs were highly upregulated at 22-24 h of embryos (Fig. 2) and 42% of lncRNAs were greatly upregulated at the puff stage 7-9 (i.e., L6) of larvae (Fig. 3). The massive upregulation of lncRNAs coincides with the onset of metamorphosis from embryos to larvae and from larvae to prepupae. A similar phenomenon was reported in the developmental expression of lncRNAs in zebrafish 20 . The results indicate that lncRNA enrichment in development could be important for the transformation and organogenesis in animals, even though further experiments are needed to demonstrate it.
Numerous studies have provided evidence that lncRNAs are crucial players in cell differentiation and development 6 . For example, lncRNA Fendrr presents lateral mesoderm-specific expression which is critical for proper wall development 18 . LncRNAs also exhibit conserved functions in animal embryonic development 2,15,39 . Some lncRNAs are involved in processes directing embryonic stem cell pluripotency and differentiation 15 . Interestingly, expression of many lncRNAs is regulated in response to hormone 40 . The coordinated high expression of numerous lncRNAs in late embryos and larvae are highly in synchrony with the high titer pulses of the steroid hormone ecdysone in Drosophila 40,41 . Temporal boundaries in the Drosophila life cycle is defined by ecdysone, whose titrate pulses trigger the major postembryonic developmental transitions, including molting and metamorphosis 35,41 . Ecdysone exerts its effects through a regulatory cascade of ecdysone -ecdysone receptors (e.g., EcR and USP) -target genes. Some lncRNAs act as coactivator 42 or repressor 43 of hormone receptors and respond to hormone titration 5 . The upregulation of lncRNAs could be responding to a temporal signal that occurs in association with or in parallel with the ecdysone -EcR signaling pathway. Therefore, lncRNAs may mediate the timing of embryo-to-pupae cell fate decisions by interacting with ecdysone and temporal protein regulators of differentiation.

D. melanogaster transcriptomes and transcript expression.
The transcriptome data of D. melanogaster were obtained from modENCODE (http://www.modencode.org). The transcriptomes have an annotation of 43,967 transcripts that are pooled from RNA expressed at 27 distinct developmental stages from embryo to adult. These stages include 12 embryonic stages divided at 2 h intervals for 24 h, 6 larval stages, 6 pupal stages, and 3 male and 3 female adult stages. The transcriptomes of each stage were generated from poly(A) RNAs, but embryonic transcriptomes were generated from both poly(A) RNAs and total RNAs 32 . A variety of methods and sequencing data, such as complementary tiling microarray and RNA-Seq, long read sequencing, and genome mapping, were used for the prediction of novel splice junctions and the mapping and annotation of transcripts and genes. RNAs from three biological replicates of each sample were prepared for tiling arrays 32 . Here 'transcript' refers to a RNA molecule that transcribes from a genomic loci whereas 'gene' refers to one or more transcripts that share exons. The expression data from poly(A) RNAs were used in this study.
Drosophila lncRNA identification. Drosophila lncRNAs were predicted from the 43,967 transcripts. Open reading frames (ORFs) in the transcripts were evaluated using the program TransDecoder (http://transdecoder. github.io). Transcripts possessing a maximum ORF encoding > 100 amino acids were filtered. A state-of-the-art and popular software CPAT 34 was used to predict ncRNAs in the transcripts. CPAT is an alignment-free method and uses a logistic regression model built with four sequence features including ORF size, ORF coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT has been optimized for prediction of Drosophila lncRNA with coding probability cutoff at 0.39 (i.e., coding probability < 0.39 indicates ncRNA, also see: http://rna-cpat. sourceforge.net/) 34 . Prediction with CPAT yields a high sensitivity (0.96) and high specificity (0.97) in theory 34 . The predicted transcripts were finally blasted against CDS annotated in Flybase (built 5.54) to evaluate false positive rate of the prediction. The ncRNAs that were filtered of all CDS and longer than 200 bp genes were labeled as putative Drosophila lncRNAs for further experimental validation and expression analysis. Validation of Drosophila lncRNAs by RT-PCR and dot-blot hybridization. Twenty intergenic lncR-NAs that were randomly selected from the predicted lncRNAs were validated by RT-PCR and RNA dot hybridization (Supplementary Table S1). Total RNA was isolated from whole body tissues of 5-d-old female adults of D. melanogaster (strain OreR) and was treated with Dnase I. Reverse transcription was then performed using Oligo-d(T) following the Omniscript RT kit instruction (QIAGEN, Cat. no. 205110). We first used RT-PCR to validate the candidate lncRNAs. The amplification conditions were as follows: 94 °C for 1 min and 40 cycles of 94 °C for 30 s, 58 °C for 30 s, and 72 °C for 30 s. Thirty cycles of amplification were used for the amplification. The RT-PCR products were purified and were sent for direct sequencing to verify their sequences. Positive and negative controls were also subjected to RT-PCR. The positive control contained genomic DNA as the template for PCR to test the validity of the designed primers and the PCR system used. The negative control had no reverse transcriptase added for reverse transcription. Expression of 12 lncRNAs were further validated by RNA hybridization. These lncRNAs except one, i.e. chr2L_21798044_21799854, are single exoned. High specific RNA probes were first synthesized using Biotin RNA labeling mix (Roche, Cat. no. 11685597910) and T7 RNA polymerase (Promega, Cat. no. p2075). Total 10 μg RNA were hybridized with the probes on BrightStar-Plus nylon membrane (Ambion, Cat. no. AM10102). After blocking and wash with Chemiluminescent Nucleic Acid detection Module (Thermal Scientific, Cat. no. 89880), the biotin-labeled RNA was detected using CCD imaging station system (Carestream Health, USA). Primers for hybridization probes were the same as those used for the RT-PCR. Details on the selected ncRNAs and PCR primers can be found in Supplementary Table S1. Note that all these primers were designed to amplify exonic region.
Developmental specificity of lncRNA expression. FPKM expression levels were normalized to obtain the relative expression levels over the course of development and then to visualize developmental expression profiles via heatmaps. The lncRNAs and protein-coding loci were separately clustered using Cluster 3.0 software using k-means with a city-block distance matrix 44 . The transcripts were clustered into 10 groups at the embryonic stage, 8 groups at the larval stage, and 9 groups at the pupal stage to achieve distinct clusters. The 976 lncRNAs identified in this study and the 33,074 protein-coding transcripts were analyzed for the heatmap and developmental specificity of expression. The protein-coding transcripts were further filtered to exclude housekeeping genes for the analysis of developmental specificity. By comparing with human housekeeping genes 45 , a total of 941 orthologous transcripts from 172 genes were taken as Drosophila housekeeping genes (Supplemental data 2).
The developmental specificity of the expression of a transcript was evaluated following the measure used by Trapnell et al. 46 and Cabili et al. 16 . The entropy-based specificity measure quantifies the similarity between the expression pattern of an ncRNA or coding transcript and another predefined transcript expressed at only one stage. The developmental specificity is defined as the maximal specificity score of the transcript expression pattern across all developmental stages, i.e., maximal JS specificity score. In the present study, we examined the specificity measures of transcripts expressed at the embryonic, larval, and pupal stages.

Statistical analyses.
Fisher's exact test was performed to compare the difference in the number of genes. All lncRNAs were sorted based on their averaged expression level in different periods of a developmental stage, and categorized according to their ranking of expression. The difference in JS values between gene sets was analyzed by Mann-Whitney U Test.