Comparative Analysis of the Transcriptome across Distant Species

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features 1-6 . Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-tolarvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
Our comparison used the ENCODE-modENCODE RNA resource (Extended Data Fig. 1). This resource comprises: deeply sequenced RNAsequencing (RNA-seq) data from many distinct samples from all three organisms; comprehensive annotation of transcribed elements; and uniformly processed, standardized analysis files, focusing on non-coding transcription and expression patterns. Where practical, these data sets match comparable samples across organisms and to other types of functional genomics data. In total, the resource contains 575 different experiments containing .67 billion sequence reads. It encompasses many different RNA types, including poly(A)1, poly(A)-, ribosomal-RNAdepleted, short and long RNA.
The annotation in the resource represents a capstone for the decadelong efforts in human, worm and fly. The new annotation sets have numbers, sizes and families of protein-coding genes similar to previous *These authors contributed equally to this work. 1These authors jointly supervised this work.
compilations; however, the number of pseudogenes and annotated noncoding RNAs differ (Extended Data Fig. 2, Extended Data Table 1 and Supplementary Fig. 1). Also, the number of splicing events is greatly increased, resulting in a concomitant increase in protein complexity. We find the proportion of the different types of alternative splicing (for example, exon skipping or intron retention) is generally similar across the three organisms; however, skipped exons predominate in human while retained introns are most common in worm and fly 7 (Extended Data Fig. 3 A fraction of the transcription comes from genomic regions not associated with standard annotations, representing 'non-canonical transcription' (Supplementary Table 2) 8 . Using a minimum-run-maximum-gap algorithm to process reads mapping outside of protein-coding transcripts, pseudogenes and annotated non-coding RNAs, we identified read clusters; that is, transcriptionally active regions (TARs). Across all three genomes we found roughly one-third of the bases gives rise to TARs or non-canonical transcription (Extended Data Table 1). To determine the extent that this transcription represents an expansion of the current established classes of non-coding RNAs, we identified the TARs most similar to known annotated non-coding RNAs using a supervised classifier 9 (Supplementary Fig. 2 and Supplementary Table 2). We validated the classifier's predictions using RT-PCR (PCR with reverse transcription), demonstrating high accuracy. Overall, these predictions encompass only a small fraction of all TARs, suggesting that most TARs have features distinct from annotated non-coding RNAs and that the majority of non-coding RNAs of established classes have already been identified. To shed further light on the possible roles of TARs we intersected them with enhancers and HOT (high-occupancy target) regions 8,10-13 , finding statistically significant overlaps (Extended Data Fig. 4 and Supplementary Table 2).
Given the uniformly processed nature of the data and annotations, we were able to make comparisons across organisms. First, we built coexpression modules, extending earlier analysis 14 (Fig. 1a). To detect modules consistently across the three species, we combined across-species orthology and within-species co-expression relationships. In the resulting multilayer network we searched for dense subgraphs (modules), using simulated annealing 15,16 . We found some modules dominated by a single species, whereas others contain genes from two or three. As expected, the modules with genes from multiple species are enriched in orthologues.
Moreover, a phylogenetic analysis shows that the genes in such modules are more conserved across 56 diverse animal species (Extended Data Fig. 5 and Supplementary Fig. 3). To focus on the cross-species conserved functions, we restricted the clustering to orthologues, arriving at 16 conserved modules, which are enriched in a variety of functions, ranging from morphogenesis to chromatin remodelling ( Fig. 1a and Supplementary Table 3). Finally, we annotated many TARs based on correlating their expression profiles with these modules (Extended Data Fig. 4).
Next, we used expression profiles of orthologous genes to align the developmental stages in worm and fly ( Fig. 1b and Extended Data Fig. 6). For every developmental stage, we identified stage-associated genes; that worm and fly gene-gene co-association matrix; darker colouring reflects the increased likelihood that a pair of genes are assigned to the same module. A dark block along the diagonal represents a group of genes within a species. If this is associated with an off-diagonal block then it is a cross-species module (for example, a three-species conserved module is shown with a circle and a worm-fly module, with a star). However, if a diagonal block has no off-diagonal associations, then it forms a species-specific module (for example, green pentagon). Right, the Gene Ontology functional enrichment of genes within the 16 conserved modules is shown. GF, growth factor; nuc., nuclear; proc., processing. b, Primary and secondary alignments of worm-and-fly developmental stages based on all worm-fly orthologues. Inset shows worm-fly stage alignment using only hourglass orthologues is more significant and exhibits a gap

RESEARCH LETTER
is, genes highly expressed at that particular stage but not across all stages. We then counted the number of orthologous pairs among these stageassociated genes for each possible worm-and-fly stage correspondence, aligning stages by the significance of the overlap. Notably, worm stages map to two sets of fly stages. First, they match in a co-linear fashion to the fly (that is, embryos-to-embryos, larvae-to-larvae). However, worm late embryonic stages also match fly pupal stages, suggesting a shared expression program between embryogenesis and metamorphosis. The approximately 50 stage-associated genes involved in this dual alignment are enriched in functions such as ion transport and cation-channel activity (Supplementary Table 3).
To gain further insight into the stage alignment, we examined our 16 conserved modules in terms of the 'hourglass hypothesis', which posits that all animals go through a particular stage in embryonic development (the tight point of the hourglass or 'phylotypic' stage) during which the expression divergence across species for orthologous genes is smallest 4,5,17 . For genes in 12 of the 16 modules, we observed canonical hourglass behaviour; that is, inter-organism expression divergence across closely related fly species during development is minimal 5 (Supplementary Fig. 3). Moreover, we find a subset of TARs also exhibit this hourglass behaviour ( Supplementary Fig. 2). Beyond looking at inter-species divergence, we also investigated the intra-species divergence within just Drosophila melanogaster and Caenorhabditis elegans. Notably, we observed that divergence of gene expression between modules is minimized during the worm and fly phylotypic stages (Fig. 1c). This suggests, for an individual species, the expression patterns of different modules are most tightly coordinated (low divergence) during the phylotypic stage, but each module has its own expression signature before and after this. In fact it is possible to see this coordination directly as a local maximum in between-module correlations for the worm (Extended Data Fig. 5). Finally, using genes from just the 12 'hourglass modules', we found that the alignment between worm and fly stages becomes stronger (Fig. 1b and Supplementary Fig. 3); in particular it shows a gap where no changes are observed, perfectly matching the phylotypic stage.
The uniformly processed and matched nature of the transcriptome data also facilitates integration with upstream factor-binding and chromatinmodification signals. We investigated the degree to which these upstream signals can quantitatively predict gene expression and how consistent this prediction is across organisms. Similar to previous reports 11,18,19 , we found consistent correlations, around the transcription start site (TSS), in each of the three species between various histone-modification signals and the expression level of the downstream gene: H3K4me1, H3K4me2, H3K4me3 and H3K27ac are positively correlated, whereas H3K27me3 is negatively correlated (Fig. 2, Extended Data Fig. 7 and Supplementary  Fig. 4). Then for each organism, we integrated these individual correlations into a multivariate, statistical model, obtaining high accuracy in predicting expression for protein-coding genes and non-coding RNAs. The promoter-associated marks, H3K4me2 and H3K4me3, consistently have the highest contribution to the model.
A similar statistical analysis with transcription factors showed the correlation between gene expression and transcription-factor binding to be the greatest at the TSS, positively for activators and negatively for repressors (Extended Data Fig. 7). Integrated transcription-factor models in each organism also achieved high accuracy for protein-coding genes and non-coding RNAs, with as few as five transcription factors necessary for accurate predictions (Extended Data Fig. 8). This perhaps reflects an intricate, correlated structure to regulation. The relative importance of the upstream regions is more peaked for the transcription-factor models than for the histone ones, likely reflecting the fact that histone modifications are spread over broader regions, including the gene body, whereas most transcription factors bind near the promoter.
Finally, we constructed a 'universal model', containing a single set of organism-independent parameters ( Fig. 2 and Supplementary Fig. 4). This achieved accuracy comparable to the organism-specific models. In the universal model, the consistently important promoter-associated marks such as H3K4me2 and H3K4me3 are weighted most highly. In contrast, the enhancer mark H3K4me1 is down-weighted, perhaps reflecting that signals for most human enhancers are not near the TSS. Using the same set of organism-independent parameters derived from training on protein-coding genes, the universal model can also accurately predict non-coding RNA expression.
Overall, our comparison of the transcriptomes of three phylogenetically distant metazoans highlights fundamental features of transcription conserved across animal phyla. First, there are ancient co-expression modules across organisms, many of which are enriched for developmentally important hourglass genes. These conserved modules have highly coordinated intra-organism expression during the phylotypic stage, but display diversified expression before and after. The expression clustering also aligns developmental stages between worm and fly, revealing shared expression programs between embryogenesis and metamorphosis. Finally, we were able to build a single model that could predict transcription in all three organisms from upstream histone marks using a single set of parameters for both protein-coding genes and non-coding RNAs. Overall, our results underscore the importance of comparing divergent model organisms to human to highlight conserved biological principles (and disentangle them from lineage-specific adaptations).   a, The overlap of enhancers and distal HOT regions with supervised noncoding RNA predictions and TARs in human, worm and fly. The overlap of enhancers and distal HOT regions with respect to both supervised non-coding RNA predictions as well as TARs are significantly enriched compared to a randomized expectation. b, The left side highlights non-coding RNA and TARs that are highly correlated with corresponding HOX orthologues in human (HOXB4), worm (lin-39) and fly (Dfd). The expression of mir-10 correlates strongly with Dfd in fly (r 5 0.66, P , 6 3 10 24 in fly), as does mir-10a in human, which correlates strongly with HOXB4 (r 5 0.88, P , 2 3 10 29 ). A TAR (chr III: 8871234-2613) strongly correlates with lin-39 (r 5 0.91, P , 4 3 10 213 ) in worm. The right side shows TARs in human (chr 19: 7698570-7701990), worm (chr II: 11469045-440), and fly (chr 2L: 2969620-772) that are negatively correlated with the expression of three orthologous genes: SGCB (r 5 20.91, P , 3 3 10 216 ), sgcb-1 (r 5 20.86, P , 2 3 10 27 ) and Scgb (r 5 20.82, P , 4 3 10 28 ), respectively. (More details on all parts of this figure are in Supplementary Information,