Amalgamated cross-species transcriptomes indicate organ-specific preadaptation for functional shifts in gene expression

The origins of complex multicellular physiology lie in the evolution of gene expression. Genes express differentially among organs as organisms evolve, but it is not well understood whether expression in certain organs increases the probability that they will be repurposed for expression in other organs. To examine this question, we amalgamated 1,903 RNA-seq datasets from 182 research projects, including 6 organs in 21 vertebrate species. We used various automated quality controls to eliminate project-specific biases, and defined expression regime evolution using tree-based Ornstein-Uhlenbeck models, allowing us to reconstruct evolutionary pathways of gene-family-wise expression on a genome-wide scale. Fluxes in organ-specific gene expression were non-random, suggesting that some ancestral expression patterns strongly preadapted genes for expression in certain organs, while others did not. For example, brain, ovary, and testis tend to change expression from one to another, while kidney and liver form a separate loop of gene expression evolution, illustrating a strong modularity of gene exchange among vertebrate organs. More subtle but significant differences depended on the type of gene duplication. Notably, RNA-based gene duplications tended to generate asymmetric exchange fluxes between organs. This detailed view of the evolutionary dynamics of gene expression supports a major role for preadaptive pathways.

shift genome location tend to have higher expression shift frequencies than those that do 131 not (Fig. S9), in line with previous observations from the human genome (Lan and 132 Pritchard, 2016). Expression levels varied most in ovary and testes, which had 133 significantly higher average stationary variances than the other four organs (Fig. 3C). 134 This is consistent with the observation of strong sex-related divergence of gene 135 expression in testes (Brawand et al., 2011), but shows strong sex-related divergence in 136 the reproductive organs of both sexes. 137 Expression shifts may coincide with changes in protein sequence evolution, and an 138 analysis of nonsynonymous/synonymous substitution rate ratios revealed a significant 139 association in all three branching categories, with a larger effect size in duplication than 140 in speciation (Fig. S10). An expression shift is associated with a 24% increase in 141 median nonsynonymous/synonymous substitution rate ratio in DNA-based duplications 142 (0.218 to 0.270) and 764% increase in RNA-based duplications (0.0456 to 0.394), in 143 contrast to only 4% increase in speciation (0.0941 to 0.0977). These differences are 144 much smaller than the differences between duplication and speciation, and small 145 compared to the variance among orthogroups (4.38), suggesting that the effect is not 146 large enough to be reliably detected in individual gene family analyses. 147 To further characterize how expression properties change from ancestral to derived 148 regimes, we examined changes in expression specificity, measured by τ (Yanai et al., 149 2005). Genes with highly specific expression have τ values close to one, whereas 150 uniformly expressed genes yield τ values close to zero. Compared with speciation, 151 expression patterns following RNA-based duplications are more specific, and 152 in expression properties were more drastic in RNA-based duplications, followed by 164 DNA-based duplications and speciation events, probably reflecting the different modes 165 of inheritance of regulatory environments. RNA-based duplication accompanies the 166 elimination of regulatory elements outside transcribed regions, whereas DNA-based 167 duplication can retain certain regulatory regions depending on the range of duplicated 168 genomic segments (Babushok et al., 2007). 169 To reveal the evolutionary dynamics of gene expression patterns, we quantified the 170 gain and loss of genes characterized by the organ in which they are most highly 171 expressed (primary-expressed). Across vertebrates, switching from one 172 primary-expressed organ to another was detected in 14,403, 4,364, and 783 regime 173 shifts after speciation, DNA-based duplication, and RNA-based duplication, 174 respectively (Fig. 4A;Fig. S11). Of note, the gain/loss ratios are heterogeneous among 175 organs, suggesting that vertebrate organs serve as both sources and sinks in expression 176 evolution, but that their relative contributions are organ-specific. 177 The preadaptation hypothesis predicts that expression shifts depend on ancestral 178 expression conditions prior to the shifts. To test this, we examined whether shifts in 179 primary-expressed organs are significantly different from random expectation 180 (controlled for the total number of shifts from and to each organ). There are clear 181 patterns of evolutionary transitions, many of which are largely shared between 182 speciation and duplication events (Fig. 4B). For example, brain, ovary, and testis 183 showed strong connections, indicating a solid exchange module. Kidney and liver also 184 donate genes to one another, forming a separate module from brain-ovary-testis. 185 al., 2017). Larger orthogroups and longer genes tended to fit more complex substitution 294 matrices and larger numbers of categories of rate heterogeneity  Table  295 S4). Ultrafast bootstrapping with 1,000 replicates was performed to evaluate the 296 credibility of tree topology (Minh et al., 2013) with a further optimization of each 297 bootstrapping tree (-bnni option) (Hoang et al., 2018). 298 Reconciliation-assisted gene tree rooting. Candidate rooting positions were 299 inferred with different methods. Using the dated species tree, all rooting branches with 300 the minimum duplication-loss score were identified using the rooting mode of 301 NOTUNG 2.9 with the default parameters (duplication score = 1.5, loss score = 1.0) 302 (Chen et al., 2000). The midpoint of the longest path (Farris, 1972) and the position 303 with the minimal ancestor deviations (MAD) (Tria et al., 2017) were also considered as 304 candidates. The final rooting position was reported based on overlaps among those 305 rooting positions (Figs. S14C-S15; Table S4). 306 Reconciliation-assisted divergence time estimation. To prepare dated gene trees, 307 we first matched species tree nodes with corresponding gene tree nodes using the 308 reconciliation mode of NOTUNG 2.9 (Chen et al., 2000) and created time constraints of 309 speciation nodes (Fig. S15). Duplication nodes were constrained with the upper and 310 lower age limits derived from corresponding speciation nodes. If the root node is a 311 duplication node and is not covered by the range of the species tree, the upper age limit 312 was set to 1,105 million years ago, which corresponds to the split of animals and fungi 313 (Hedges et al., 2006). Divergence time was then estimated by a penalized likelihood 314 method (Sanderson, 2002) implemented in an R package 'ape' ('chronos' function) 315 (Popescu et al., 2012) with time constraints on speciation, duplication, and root nodes. 316 On 4,472 out of 15,282 gene trees, reasonable initial parameters were not found after 317 analysis (especially those with a large number of genes), we skimmed gene trees by 326 collapsing clades with small changes in expression level (Fig. S16). Specifically, we 327 first calculated all-vs-all Pearson's correlation coefficients of gene expression level 328 among all genes that belong the clade. The clade was replaced with a single tip if the 329 minimum correlation coefficient was greater than 0.5. Phylogenetic means of the clade 330 were calculated by assuming Brownian motion and were used as expression level at the 331 new tip. Although the use of PhylogeneticEM and the tree skimming reduced the 332 number of detected regime shifts, the positions of detected shifts were largely 333 overlapped with those inferred with l1ou without tree skimming (Fig. S17). In addition, 334 this strategy allowed us to analyze the largest gene tree with 6,553 genes. Therefore, we 335 used PhylogeneticEM and/or the tree skimming if the gene tree cannot be analyzed by 336 l1ou with non-skimmed topology (Table S4; 8/15,282 trees). The upper limit of regime 337 shifts was set as max $min( /2, 100), √ 0 1. In most gene trees (15,278/15,282), the 338 number of detected regime shifts was smaller than the upper limit (Fig. S18). 339 Analysis of expression patterns. We characterized expression patterns of extant 340 and ancestral genes by calculating different metrics from fitted values (µ) in the OU 341 models. Organ specificity was measured by τ (Yanai et al., 2005), which outperformed 342 other methods in a benchmark for tissue specificity (Kryuchkova-Mostacci and 343 Robinson-Rechavi, 2016). Expression complementarity between sister lineages was 344 measured by TEC (Huerta-Cepas et al., 2011). Because µ was estimated from 345 log-transformed expression levels, these expression metrics were calculated with 346 unlog-transformed µ values. Primary-expressed organs of gene expression patterns were 347 defined as the organ in which the highest expression levels were observed among the 348 six organs we analyzed. 349 with a posterior probability greater than 0.5 was classified as a retrotransposition event. 390 Although our classification cannot detect RNA-based duplication from originally 391 intronless genes, we expect such situations would be rare because most vertebrate genes 392 contain at least one intron (e.g., 20,160/21,242 human genes in our dataset). 393 Inter-chromosomal translocation was detected by considering chromosomal locations 394 with the highest posterior probability as the ancestral states. Because of the difficulty in 395 determining rooting positions of deep phylogenies, gene tree nodes older than the root 396 node of the species tree were removed from the analysis. 397 Gene tree annotations and visualization. Phylogenetic trees were visualized 398 using a python package 'ETE 3' (Huerta-Cepas et al., 2016) and an R package 'ggtree' 399 (Yu et al., 2017). A part of animal silhouettes in Fig. 3A  . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

References: 414
Altenhoff, A.M., Studer, R.A., Robinson-Rechavi, M., Dessimoz, C., and Couto, F. 415 (2012). Resolving the ortholog conjecture: orthologs tend to be weakly, but 416 significantly, more similar in function than paralogs. PLoS Comput. Biol. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; organs and species. Genome Biol. 17, 151. 447 Brunner, E., and Munzel, U. (2000). The nonparametric Behrens-Fisher problem: 448 asymptotic theory and a small-sample approximation. Biometrical J. 42, 17-25. 449 Budd, G.E. (2007). On the origin and evolution of major morphological characters. Biol. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; variables were detected against 52 RNA-seq data from 8 projects (D). Analysis of those 654 variables by linear regression highlights the BioProject feature as the strongest source of 655 removed biases. For full description of predictors, see . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018;  . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Tables S1-S5 (separate file) 703 Figs. S1-S18 704 Supplementary Data (separate file) 705 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Fig. S3. A correlation analysis for the detection and removal of anomalous 715
RNA-seq samples. Expression levels of all genes were compared between the sample 716 and the organ averages. A sample was removed if any between-organ comparisons 717 yielded a correlation coefficient higher than the within-organ comparison. 718 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Canis lupus
Pearson's correlation coefficient

Xenopus tropicalis
Pearson's correlation coefficient CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; Fig. S7. Predictor analysis of detected surrogate variables. The predictive power was 734 analyzed by linear regression using different properties of RNA-seq experiments: organ (percentage of reads that are mapped to non-nuclear-mRNA features and are removed 740 from the analysis; e.g., 5%), minimum read length (e.g., 25 nt), average read length (e.g., 741 70 nt), maximum read length (e.g., 75 nt), and mapping rate (e.g., 80%). The predictors 742 are summarized in Table S1. 743 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Fig. S9. The relationships between expression shifts and chromosomal location. 749
The heatmap shows the frequency of expression shifts observed among the branches 750 with or without a change in the chromosomal category (non-diagonal or diagonal, 751 respectively). Chromosomal locations were categorized into autosomes (A), X 752 chromosome (X) and Y chromosome (Y), and the ancestral locations were inferred by 753 stochastic character mapping. 754 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.   Maximum-likelihood estimation . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018;  . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; Speciation nodes in the dated species tree were mapped onto the non-dated gene tree by 799 phylogeny reconciliation. By using those nodes as calibration points, the other node 800 ages were estimated using the penalized likelihood method. 801 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/409888 doi: bioRxiv preprint first posted online Sep. 5, 2018; . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.