Evolutionary characteristics of intergenic transcribed regions indicate widespread noisy transcription in the Poaceae

Extensive transcriptional activity occurring in unannotated, intergenic regions of genomes has raised the question whether intergenic transcription represents the activity of novel genes or noisy expression. To address this, we evaluated cross-species and post-duplication sequence and expression conservation of intergenic transcribed regions (ITRs) in four Poaceae species. Most ITR sequences are species-specific. Those found across species tend to be more divergent in expression and have more recent duplicates compared to annotated genes. To assess if ITRs are functional (under selection), machine learning models were established in Oryza sativa (rice) that could distinguish between benchmark functional (phenotype genes) and nonfunctional (pseudogenes) sequences with high accuracy based on 44 evolutionary and biochemical features. Based on the prediction models, 584 rice ITRs (8%) are classified as likely functional that tend to have conserved expression and ancient retained duplicates. However, most ITRs do not exhibit sequence or expression conservation across species or following duplication, consistent with computational predictions that suggest 61% ITRs are not under selection. We outline key evolutionary characteristics that are tightly associated with likely-functional ITRs and provide a framework to identify novel genes to improve genome annotation and move toward connecting genotype to phenotype in crop and model systems.

ABSTRACT 1 Extensive transcriptional activity occurring in unannotated, intergenic regions of genomes has 2 raised the question whether intergenic transcription represents the activity of novel genes or 3 noisy expression. To address this, we evaluated cross-species and post-duplication sequence 4 and expression conservation of intergenic transcribed regions (ITRs) in four Poaceae species. 5 Most ITR sequences are species-specific. Those found across species tend to be more 6 divergent in expression and have more recent duplicates compared to annotated genes. To 7 assess if ITRs are functional (under selection), machine learning models were established in 8 Oryza sativa (rice) that could distinguish between benchmark functional (phenotype genes) and 9 nonfunctional (pseudogenes) sequences with high accuracy based on 44 evolutionary and 10 biochemical features. Based on the prediction models, 584 rice ITRs (8%) are classified as 11 likely functional that tend to have conserved expression and ancient retained duplicates. 12 However, most ITRs do not exhibit sequence or expression conservation across species or 13 following duplication, consistent with computational predictions that suggest 61% ITRs are not 14 under selection. We outline key evolutionary characteristics that are tightly associated with 15 likely-functional ITRs and provide a framework to identify novel genes to improve genome 16 annotation and move toward connecting genotype to phenotype in crop and model systems. Bakel et al. 2010), intergenic transcripts may also represent the activity of novel genes. These 7 intergenic transcripts may play roles as competitive endogenous RNAs (Tan et al. 2015), cis-8 acting regulatory transcripts (Guil and Esteller 2012), and/or small protein-coding regions that 9 can be missed by gene finding programs (Hanada et al. 2013). In addition to these possible 10 functions, intergenic transcribed regions (ITRs) may represent the products of noisy 11 transcription, resulting from imperfect regulation of the gene expression machinery (Struhl 2007; 12 Moghe et al. 2013) Although some of these ITRs may ultimately provide the raw materials from 13 which novel genes may evolve de novo (Carvunis et al. 2012), they are not expected to be 14 under selection. Thus, identifying the sequence features that distinguish between functional 15 intergenic transcripts and those that are the products of noisy transcription represents a critical 16 task in genome biology. 17 Here, we adopt the "selected effect" definition of function, which stipulates that a 18 measurable activity, e.g. transcription, is considered functional if such activity is under natural 19 selection (Amundson and Lauder 1994; Graur et al. 2013; Doolittle et al. 2014). This definition 20 stands in contrast to the "causal role" definition that considers any reproducible biochemical 21 activity, such as intergenic transcription, to be functional (Amundson and Lauder 1994;22 ENCODE Project Consortium 2012). However, pseudogenes that are remnants of once 23 functional genes can be expressed (Zou et al. 2009; Pei et al. 2012), indicating that sequences 24 that are not under selection could be considered functional based on the causal role definition 25 due to biochemical activities. Given these considerations, the causal role definition of function is 26 inadequate for distinguishing functional ITRs from noisy ones. To establish selected effect 27 functionality in a sequence, significant conservation in sequences or activities provides strong 28 evidence. However, lack of significant conservation does not necessarily indicate a lack of 29 selective pressure, which may be due to selection that is significant but too weak to be detected 30 based on sequence conservation or because only a small stretch of a sequence is under 31 selection (Pang et al. 2006;Ponting 2017). Instead of relying on a single line of evidence such 32 as sequence conservation, an approach that integrates genetic, evolutionary, and biochemical 33 evidence has been suggested (Kellis et al. 2014). Based on this framework, predictive models 34 were established that were highly effective at identifying sequences with significant fitness cost integration of evolutionary and biochemical signatures could provide valuable insight in 38 distinguishing functional and noisy ITRs. 39 In this study, we investigate the extent of sequence and expression conservation as well 40 as potential functionality of ITRs using data from four Poaceae (grass) species: Oryza sativa 41 (rice), Brachypodium distachyon, Sorghum bicolor (sorghum), and Zea mays (maize). 1 Phylogenetically, these four species fall into two clades where one clade consists of maize and 2 sorghum diverged ~15 million years ago (MYA) (Skendzic et al. 2007; Liu et al. 2014), and the 3 other contains rice and B. distachyon diverged ~47 MYA (Massa et al. 2011). All four species 4 share ancient whole genome duplications (WGDs) (Paterson et al. 2004; Tang et al. 2010) while 5 the maize lineage experienced a more recent WGD ~12 MYA (Swigoňová et al. 2004). Thus, 6 studies of ITRs in these species are well-suited to assess not only sequence and expression 7 conservation of ITRs but also ITR duplicate retention post duplication. In addition, using rice as 8 an example, we generated function prediction models by integrating rice mutant phenotype, 9 sequence conservation, transcriptome, histone modification, DNA methylation, and nucleosome 10 occupancy data to predict functional ITRs genome-wide. 11 12

Identification and classification of Poaceae transcribed regions 14
To investigate the properties and potential functionality of polyadenylated intergenic 15 transcripts, we focused on four Poaceae species (Fig. 1A). In each of the four species, we 16 identified transcribed regions and classified them as parts of exons, introns, or pseudogenes 17 (see Methods). Transcriptome datasets were from developmentally-matched leaf, seed, and 18 reproductive tissues (Davidson et al. 2011;Davidson et al. 2012; Table S1). Transcribed 19 regions that did not overlap with gene or pseudogene annotation were considered as intergenic. 20 Intergenic transcribed regions (ITRs , Table S2) accounted for only 4-7% of mapped reads and 21 6-12% of transcribed regions (defined by continuous read mapping) in each species (Fig. 1B). 22 In contrast, 92-96% of mapped reads overlapped with annotated exons and/or introns (Fig. 1B). 23 Relationships between and transcriptome content in the four Poaceae species. (A) Phylogenetic relationships between rice (Os), B. distachyon (Bd), sorghum (Sb), and maize (Zm). Whole genome duplication (WGD) events are marked with yellow circles. MYA: millions of years ago. (B) Number of reads mapping to (left panel) and transcribed regions overlapping (right panel) exons (Ex; dark blue), introns (In; cyan), pseudogenes (Ps; red), and intergenic regions (Ig; orange). (C) Percent of nucleotides overlapped with transcribed regions that are annotated as exons, introns, pseudogenes, and intergenic regions. Abbreviations are the same as in (A). Ig' represents the proportion of intergenic space covered by transcribed regions that also overlap genic or pseudogenic regions. (D) Percent of transcribed regions that were classified as highly repetitive (lime), likely protein-coding (green), or neither of these two (Other; white).
We also found that only 0.4-2.6% of intergenic space was covered by transcribed regions in 1 each species (Fig. 1C). Thus, intergenic transcription in Poaceae is rare relative to genic 2 expression, consistent with previous findings in A. thaliana (Lloyd et al. 2018). Only a small 3 fraction of plant intergenic space is expressed compared to the ~60% intergenic expression 4 uncovered in the human genome (ENCODE Project Consortium 2012). 5 Despite the relative scarcity of intergenic transcripts, we identified 7,000 to 16,000 ITRs 6 in these four species (Fig. 1B, right panel, orange). We asked whether ITRs resemble known 7 annotated sequences and found that 16-23% of ITRs in the four species have significant 8 translated sequence similarity to plant proteins and/or contain a known protein domain ( Fig.  9  1D). In addition, only 56-94 ITRs (<1% in each species) contain known RNA gene domains. 10 ITRs resembling annotated genes may be unannotated exons, novel genes, or pseudogenes 11 that are still expressed. Only 7-23% of ITRs are highly repetitive (Fig. 1D), indicating that 12 expression of abundant transposable element families does not account for the majority of 13 intergenic expression in these species. Overall, 54-77% of ITRs among the four species do not 14 resemble annotated features and are not highly repetitive. 15 16 We next contrasted four expression propertiestranscript fragment length, expression 17 level and breadth, and reproducibility across experimentsbetween ITRs and transcribed 18

24
This is indicative of the presence of hotspots in intergenic space where reproducible 25 transcription originates, either due to the presence of truly functional sequences or noisy 26 transcription resulting from spurious regulatory signals.

27
In mammalian systems, intergenic transcripts have been suggested to be associated 28 with nearby genes (van Bakel et al. 2010), either as unannotated exon extensions or products of 29 run-on transcription. Interestingly, the expression correlations between ITRs and proximal 30 annotated gene neighbors (<500 bp) were significantly higher than expression correlation 31 between ITRs and distal gene neighbors (>500bp) or between proximal gene-gene neighbors 32 ( Fig. 2A). There are two potential explanations for this pattern: 1) ITRs proximal to genes may 33 be unannotated gene extensions or 2) gene regulation may influence expression patterns of 34 nearby unrelated transcripts. Consistent with the second explanation, we found that 35 pseudogenes, which are not gene extensions, proximal to gene neighbors exhibited similar 36 degrees of expression correlation as proximal gene-ITR pairs (Fig. 2). Taken together, the 37 expression characteristics of ITRs are consistent with noisy transcripts that are expected to be 38 short in length, and weakly and narrowly expressed (Struhl 2007). We also find no evidence to 39 support the notion that the majority of Poaceae ITRs represent unannotated exons of known 40 genes. In fact, among the four Poaceae species, ITRs were significantly more distant from 41 genes than randomly expected (U tests, all p<2x10 -36 ; Fig. 2B). 42 1

Cross-species sequence and expression conservation of ITRs 2
Sequence conservation due to selective pressure is a hallmark of functional genome 3 regions. We considered a sequence as conserved if it exhibited cross-species sequence 4 similarity significantly greater than that observed for random, unexpressed intergenic sequences 5 (see Methods). Under this framework, we found that only 15-19% of ITRs among the four 6 species were conserved across species (Fig. 3A). Significantly fewer ITRs were conserved 7 across species compared to transcribed exons (79-83% conserved), introns (21-35%), or 8 pseudogenes (32-54%; FET, all p<2x10 -9 ). Thus, ITRs primarily represent species-specific 9 sequences.

10
We next evaluated the expression conservation of ITRs by two measures. First, we 11 identified syntenic orthologs (see Methods) and determined whether regions in other species 12 orthologous to ITRs were also expressed. We found that 18-44% of conserved ITRs and their 13 orthologs were both expressed, significantly lower than the percentages for transcribed exons 14 and their orthologs (91-97%, FET, all p<2x10 -10 ; Fig 3B). Thus, species-specific gain of 15 expression may explain ITR expression. Alternatively, if the ancestral ITR sequence was 16 expressed, maintenance of ITR expression across species is rare. However, significantly fewer 17 unexpressed intergenic sequences had an ortholog that was expressed (0-7%) compared to 1 ITRs (FET, all p<0.003), indicating that, when both an ITR and its ortholog are expressed, this 2 expression conservation likely represents non-random events that are under selection. ITRs 3 with significant similarities to protein coding genes ( Fig. 1D) are more likely to have conserved 4 expression across species, as 32% of ITRs with similarity to protein-coding sequences were 5 associated with an expressed ortholog compared to 11% of ITRs without protein similarity (FET, 6 p<3x10 -6 ). We also identified sequence blocks conserved across all four species (see Methods). 7 Among 639 blocks that were intergenic in all species, we found that 15% were expressed in at 8 least one species (Fig. S2A), compared to >99% of blocks composed of exons in all four 9 species (FET, p<6x10 -11 ; Fig. S2B). When an intergenic conserved sequence block was 10 expressed, the expression was limited to one species in 74% of cases. While expression in two, 11 three, or four species was documented in only 18%, 6%, and 1.3% (n=1) cases, respectively 12 ( Fig. S2A), compared to 74% of exon blocks that were expressed in all four species (Fig. S2B).

13
Overall, we find that the two analyses lead to the same conclusion: the few ITRs with sequence 14 conservation rarely exhibit conserved expression.

15
For the second measure of expression conservation, we evaluated whether ITRs and 16 their expressed orthologs were expressed in similar tissues. For a pair of orthologs, we 17 calculated percentage of tissue commonly expressed (% commonality, see Methods).

Sequence and expression conservation of transcribed regions. (A)
Heatmaps of crossspecies sequence conservation. Blue: conserved (cross-species sequence similarity significantly greater than that observed for random, unexpressed, non-repetitive intergenic sequences). Gray: not conserved. Species abbreviations follow Fig. 1. Ex: exon, In: intron, Ps: pseudogene. (B) Percentages of syntenic orthologs with expression evidence. Syntenic orthologs were identified by conservation across pairs of cross-species syntenic blocks (see Methods). Ex: exon, In: intron (C,D) Percent commonality in tissue expression between pairs of cross-species homologous exons (based on similarity but not synteny), ITR (panel C), introns (panel D), and randomly-generated expression vectors (Rn). Given the influence that expression breadth exerts on expected % commonality, comparisons among orthologous transcribed exons and random expression vectors where matched in expression breadth to that observed in orthologous ITR and transcribed intron pairs (see Methods). Note that this results in distinct distributions for transcribed exons and random expression vectors for ITRs and transcribed introns. commonality=27%) was no different than random (mean=26%; U test, p=0.33; Fig. 3D). These 1 findings suggest that ITRs with cross-species expression conservation, particularly those with 2 higher % tissue commonality, may be enriched for sequences under selection. 3

Sequence and expression conservation of ITR paralogs 4
In the previous section, we examined cross-species sequence and expression 5 conservation. Considering that plant genomes harbor a rich history of large-and small-scale 6 duplication events, we next asked whether ITR duplicates tend to be retained over time, and if 7 so, whether the paralogous ITRs have similar expression patterns. We excluded highly-8 repetitive sequences (Fig. 1D) from the analyses because they were duplicated by definition. 9 We found that 47-50% of rice, B. distachyon, and sorghum ITRs were duplicated, compared to 10 45-48% of transcribed exons and 31-66% of random, unexpressed intergenic sequences ( Fig.  11 S3). In maize, 82% of ITRs, 74% of transcribed exons, and 95% of random, unexpressed 12 intergenic sequences were duplicated (Fig. S3). The high percentage of duplicated sequences 13 from random, unexpressed intergenic regions in maize is due to an overall greater number of 14 paralogs identified among maize exons and repetitive sequences, which were used to define 15 thresholds to call a sequence as repetitive (see Methods). Based on the estimated base 16 substitution rates (K) as a proxy for time ( Fig. 4A), ITR duplicates in all species were generated 17 from more recent duplication events compared to transcribed exons (U tests, all p<6x10 -127 ), 18 instead more closely resembling introns and pseudogenes that experience weaker or no 1 selection. Together with the rarity of ITR duplicates at larger K (>0.06, Fig. 4A) compared to 2 exons in annotated genes, most ITR duplicates are likely not maintained. 3 To more definitively determine the relationship between the timing of duplication events 4 and ITR duplicate retention, we identified ITRs present in duplicated genome blocks derived 5 from whole genome duplication (WGD) events (referred as syntenic blocks, Fig. 4B; Fig. S4). 6 We first looked at duplicate sequences derived from a WGD event 12 MYA in maize 7 (Swigoňová et al. 2004; see Methods). Among the 4,906 non-repetitive maize ITRs in these 8 syntenic blocks, 463 (9%) had syntenic duplicates (Fig. 4C), significantly more than that of 9 random, unexpressed intergenic regions (2%; FET, p<4x10 -10 ) but significantly fewer than 10 protein-coding genes (36%, FET, p < 3x10 -16 ). Thus, as many as 91% of ITR duplicates were 11 deleted, mutated beyond recognition, or translocated within a span of 12 million years. When 12 considering syntenic blocks from the much older ρ and σ WGD events >70 MYA in all four 13 species, only 1.0% (49 of 4,859) of ITRs had syntenic duplicates, compared to 28-31% of 14 protein-coding genes, indicating an overall lack of ancient duplicates. Nonetheless, a greater 15 proportion of ITRs have ρ or σ WGD-derived duplicates compared to random, unexpressed 16 intergenic regions (0.2%; FET, p<0.002; right panel, Fig. 4C), suggesting that a subset of these 17 ITR duplicates have been retained for over 70 MY. 18 We next assessed whether expression was conserved among duplicate ITRs and found 19 that 18-34% ITR duplicates (most similar paralogs) were also expressed depending on the 20 species, compared to 79-84% of paralogs of transcribed exons (FET, all p<6x10 -10 ). In addition, 21 ITR paralogs generated from recent duplication events are more likely to be expressed than 22 those from older events (U tests, all p<3x10 -5 ; Fig. 4D). This suggests that the expression of 23 ITRs may be initially maintained following duplication, but frequently lost over time. By contrast, 24 older duplicates of transcribed exons are more frequently expressed than recent duplicates (U 25 tests, all p<8x10 -109 ; Fig. 4D). Compared to ITR duplicates, fewer duplicates of random 26 unexpressed intergenic sequences were expressed (2-7%; FET, all p<4x10 -10 ). Overall, ITRs 27 were duplicated nearly as often as annotated genes but significantly fewer ITR paralogs were 28 maintained over time or expressed compared to paralogs of annotated genes. In addition, the 29 negative relationship between timing of duplication and percent expressed ITR paralogs 30 suggest transcription of ITR duplicates tend to be lost over time, characteristics that could be 31 indicative of sequences that are not under selection. 32 Predicting the functionality of rice transcribed regions 33 To further evaluate the selected effect functionality of ITRs, we first examined which 34 evolutionary and biochemical characteristics can distinguish whether a transcribed region 35 resembled a phenotype gene or pseudogene. Due to data availability, we focused solely on rice. 36 For benchmark functional sequences, we identified 513 transcribed regions that overlap exons 37 of genes with documented loss-of-function phenotypes (referred to as phenotype exons, Table  38 S3). Benchmark nonfunctional sequences consisted of 262 transcribed regions that overlapped 39 pseudogenes but not annotated genes (transcribed pseudogenes, Table S3). To distinguish 40 between the benchmark functional and nonfunctional sequences, 44 evolutionary and 41 biochemical features in five categories were examined including transcription activity, sequence 42 conservation, DNA methylation, histone modifications, and nucleosome occupancy (  Table S4). 6 Although these naïve classifiers perform better than random guessing (AUC-ROC=0.5), 7 they were far from perfect (AUC-ROC=1). Therefore, we considered all 44 features in 8 combination using a machine learning approach to generate a function prediction model (see 9 Methods). The resulting model could distinguish between phenotype exons and transcribed 10 pseudogenes with high AUC-ROC (0.94; Fig. 5A) and precision/recall (Fig. 5B). The model 11 provides a score between 0 and 1 (referred to as functional likelihood) where higher and lower 12 scores indicate similarity to phenotype exons and transcribed pseudogenes, respectively ( Fig.  13 5C-H, Table S5). Using a stringent functional likelihood threshold (T1=0.60) where only 5% of 14 pseudogenes are misclassified as functional, 75% of phenotype genes are correctly predicted 15 as functional (Fig. 5C). We also defined a second threshold (T2=0.29) where 5% of phenotype 16 exons are misclassified as pseudogenes and the majority of pseudogenes (70%) remain 17 correctly predicted (Fig. 5D). These findings indicate that the functional prediction model can 18 distinguish between functional and nonfunctional sequences. Given most of the benchmark 19 phenotype exons were protein-coding, we applied the functional prediction model to 22 1 transcribed miRNAs. Most transcribed miRNAs (64%) had functional likelihood values between 2 T1 and T2 and cannot be confidently classified as either phenotype exon-like or pseudogene-like 3 (Fig. 5E), indicating that the idiosyncrasies of miRNA sequences are not completely captured by 4 the model. However, we cannot rule out the possibility of false-positive annotations present in 5 the set of transcribed miRNAs. Nevertheless, most miRNA sequences (73%) are not predicted 6 as pseudogene-like (i.e. functional likelihood < T2). 7 We applied the function prediction model to all rice transcribed regions that were not 8 used to establish the models, including transcribed exons, transcribed introns, and ITRs. Among 9 transcribed exons, 67% were predicted as functional based on T1 and 11% were pseudogene-10 like based on T2 (Fig. 5F). Among transcribed introns that are not annotated as alternatively 11 spliced exons, only 3% were predicted as functional based on T1 and the majority (57%) were 12 predicted as pseudogene-like based on T2 (Fig. 5G), indicating that the functional prediction 13 model can effectively distinguish between exon and intron sequences. Next, we applied the 14 prediction models to ITRs and found that 8% were predicted as functional based on T1 and 61% 15 were pseudogene-like based on T2 (Fig. 5H). We found that there was a significant negative 16 correlation between ITR functional likelihoods and their distance to neighboring genes 17 (Spearman's rank correlation, ρ = -0.17, p<3x10 -47 ), with notably higher likelihoods among ITRs 18 <500 bp from a gene (Fig. S5). It remains unclear whether functional predictions among ITRs 19 are influenced by the chromatin state of neighboring genes or these ITRs represent truly 20 functional sequences. Ultimately, the majority of rice ITRs (61%) are predicted as non-functional 21 based on T2, which is consistent with the proportion of potentially-functional ITRs in A. thaliana 22 (Lloyd et al. 2018). By and large, the functional likelihood distribution of ITRs are similar to those 23 of pseudogenes and introns, but distinct from miRNA (peaking at middle functional likelihood) 24 and phenotype exons. These findings suggest that ITRs may primarily result from transcriptional 25 noise, although a small percentage are likely novel genes.

27
Identification of evolutionary characteristics that closely associate with functional 28 sequences 29 With subsets of ITRs predicted as functional and nonfunctional, we next sought out how 30 well sequence and expression conservation characteristics correlated with functional 31 predictions. Among rice ITRs with cross-species homologs in B. distachyon, sorghum, or maize 32 (defined based on sequence similarity, not synteny). Among rice ITRs with cross-species 33 homologs, 36% were predicted as functional based on T1, significantly higher than 3% for ITRs 34 without putative homologs (Fisher's Exact Test, p<2.2x10 -16 , Fig. 6A, left column), although 35 lower than that for exons (80%; FET, p<2.2x10 -16 , Fig.6A, right column). We next tested 36 whether rice ITRs with expressed orthologs (defined based on synteny) in B. distachyon 37 (diverged ~47 MYA) would primarily represent functional sequences and found that 50% of ITRs 38 with expressed orthologs were predicted as functional based on T1, significantly higher than 39 ITRs with non-expressed orthologs (12%; FET, p<0.05; Fig. 6B, left column). Similarly, 81% of 40 transcribed exons with an expressed ortholog were predicted as functional based on T1 (Fig.  41 6B, right column). Since cross-species homology was used as a feature for model building, the 42 correlation is not surprising. But expression of an ortholog was not used as a feature, indicating 1 that our findings provide strong support for ITR functionality as defined by the machine learning 2 model. We next investigated the relationship between functional likelihood of ITRs/exons and 3 duplication status. We found that single-copy ITRs exhibited significantly higher functional 4 likelihood (median = 0.26) compared to ITR duplicates (median = 0.21; U test, p<3x10 -16 ; Fig.  5 6C, left column), but lower than those of singleton transcribed exons (median = 0.78, U test, 6 p<3x10 -16 , Fig. 6C, right column). We also found that ITRs with expressed paralogs tend to 7 have higher functional likelihood (median= 0.25) compared to those with unexpressed paralogs 8 (median = 0.18; U test, p<3x10 -16 ; Fig. 6D). Despite significant correlation with functional 9 likelihood, only 9% and 10% of single-copy ITRs and ITRs with an expressed paralog, 10 respectively, were predicted as functional based on T1.

11
Among duplicated ITRs, younger duplicates tend to have lower functional likelihood 12 scores (p<0.02; Fig. 6E), but the correlation is exceedingly weak (ρ=0.05), similar to those of 13 pseudogenes (ρ=0.11, p=0.15; Fig. 6E ITRs with retained duplicates from the >70 MYA ρ or σ WGD events. Five of these sequences 18 (46%) were predicted as functional based on the stringent T1, while the other six of the 11 ρ/σ 19 ITR duplicates failed to be predicted as non-functional based on T2. Together with cross-species 20 sequence and expression conservation patterns, ancient duplicate retention provides strong 21 support for the utility of our computational models for assessing sequence functionality and 22 reaffirm our assertion that the majority of ITRs have pseudogene-like properties and likely 1 represent noisy transcription. Meanwhile, a relatively small subset of ITRs with high functional 2 likelihood may represent novel genes that warrant further study. 3 4

5
In this study, we investigated the cross-species and post-duplication evolutionary characteristics 6 of intergenic transcribed regions (ITRs) in four grass species. ITR sequences are primarily 7 species-specific and, when ITR orthologs can be identified, the majority are not expressed. 8 Similarly, ITR paralogs, which tend to be from recent duplication events, are usually not 9 expressed. Thus, expression of an orthologous or duplicated ITR either evolved post-10 speciation/duplication or ITR orthologs/paralogs tend to quickly lose ancestral expression 11 states. Either case suggests that expression among ITRs and related sequences may evolve 12 more rapidly compared to the significantly higher degrees of cross-species and duplicate 13 expression conservation seen among exons. We also found only rare examples of ITRs with 14 paralogs retained from whole genome duplication events occurring ~12-70 MYA, suggesting 15 there may be little selective pressure on most ITRs exerted through, for example, 16 neofunctionalization. In addition to evolutionary characteristics, we evaluated the transcriptional 17 features and found that ITRs tend to exhibit weak and narrow expression compared to 18 annotated protein-coding exons. Overall, the evolutionary and expression characteristics 19 explored here are consistent with the notion that intergenic expression is not subjected to strong 20 selection and primarily represents transcriptional noise. 21 We also established robust functional prediction models for rice transcribed sequences 22 using 44 evolutionary and biochemical characteristics of benchmark phenotype genes and 23 expressed pseudogenes. These models distinguished between phenotype genes and 24 pseudogene sequences with high accuracy. Based on model predictions, 583 ITRs in rice (8%) 25 were classified as likely functional based on a conservative estimate of functionality (FPR of 26 5%). A previous study in Arabidopsis thaliana that employed selection within and between 27 species as a criterion for defining likely functionality also found ~5% of the sequences functional 28 (Moghe et al, 2013). In addition, in A. thaliana, a similar data integration framework and 29 conservative functional estimate found that 7% of A. thaliana ITRs in were predicted as 30 functional (Lloyd et al., 2018), suggesting that this proportion may represent a reasonable 31 expectation for the proportion of potentially functional ITR sequences in plants.

32
Results from function prediction models indicated that ITRs with two characteristics: (1) 33 cross-species expression conservation and (2) ancient retained duplicates primarily resemble 34 phenotype genes. More importantly, prediction models also classified 203 (3%) of rice ITRs that 35 lack sequence conservation as functional. This finding indicates that the data integration 36 framework applied here can identify ITRs that may function in species-specific roles or when 37 sequence conservation is not readily detectable (Ponting 2017

5
Gene, pseudogene, and random intergenic annotation 6 The MAKER-P (r1065) genome annotation pipeline was used to reannotate the Oryza pipeline to mask repetitive elements. To aid in gene prediction, EST evidence was provided by 13 transcriptome assemblies generated from selected publicly available data from the National 14 Center for Biotechnology Information Sequence Read Archive (NCBI-SRA) ( contiguous ambiguous nucleotides (Ns) as likely-unmappable, a length that represented the 23 size of reads in RNA-sequencing (RNA-seq) datasets used in our analyses (see below). Likely-24 unmappable regions were excluded when calculating the size of total exon, intron, pseudogene, 25 and intergenic space in Fig. 1B. Random sets of intergenic coordinates with equal length 26 distributions as intergenic transcribed regions (ITRs, see below) were identified in each species.

27
Random intergenic coordinates were sampled so that they did not overlap one another, 28 transcribed regions, or likely-unmappable genome regions (as described above). We further 29 filtered random intergenic coordinates to remove those that contained an ambiguous nucleotide, 30 a step that removed 4-9% of random coordinates in each species. 31

Identification and classification of transcribed regions 32
Multiple developmentally-matched RNA-seq datasets from rice, B. distachyon, sorghum, 33 and maize (n = 11, 11, 10, and 14 respectively) were retrieved from the Sequence Read Archive 34 at the National Center for Biotechnology Information ( in length and with an average Phred score ≥20 were mapped to associated genomes using 38 TopHat v. for sequence-specific biases. Transcribed regions from across datasets that overlapped with 2 one another by at least 1 nucleotide were merged. Merged transcribed regions were classified 3 based on overlap with annotated exon, intron, and pseudogene sequences through a priority 4 system: exon > intron > pseudogene (e.g. a transcribed region that overlapped both an exon 5 and an intron was classified as exon). Transcribed regions that did not overlap with any gene or 6 pseudogene annotation were classified as intergenic. 7 Transcribed regions were further categorized as likely protein coding or repetitive. Likely 8 protein-coding transcribed regions were defined as those that contained a protein domain from 9 Pfam v. 31 were considered likely-repetitive. We also identified a threshold based on the number of 24 duplicate sequences required to call a sequence as highly-repetitive based on the duplicate 25 counts of long terminal repeats (a subset of the interspersed repetitive elements identified by 26 RepeatMasker) and transcribed regions that overlapped exons, through F-measure 27 maximization. Resulting duplicate count thresholds to consider a transcribed region as repetitive 28 were 10, 15, 33, and 202 for rice, B. distachyon, sorghum, and maize, respectively. For all 29 duplication-related analysis, we excluded sequences that were classified as repetitive, as they 30 were considered duplicated by definition. 31

Sequence and expression conservation 32
Cross-species sequence matches were identified using BLAST searches (BLAST v.  Expression conservation across species and following duplication was assessed by 3 determining whether a cross-species or within-species match was overlapped by a transcribed 4 region (as described above). Reciprocal matches between exon, intron, and intergenic 5 transcribed regions for cross-species syntenic gene blocks (see below) were included in percent 6 commonality calculations (Fig. 3B,C), which was calculated as the number of expressed tissues 7 in common between a pair of transcribed regions divided by the total number of expressed 8 tissues. The five tissues that were in common among all four species were considered: embryo, 9 endosperm, seed, leaf, and anther. When replicate datasets were available, transcription 10 evidence in a single dataset was required to consider a sequence expressed in the tissue.

13
Expected percent commonality is affected by the expression breadth of two transcripts. 14 For example, two orthologous transcribed regions that are both expressed in four out of five 15 tissues have a minimum % commonality of 60%, while two orthologous transcribed regions both 16 expressed in a single tissue have 20 possibilities for 0% commonality. As transcribed exons are 17 expressed in more tissues than ITRs or transcribed introns (Fig. S1C) showed the same expression breadth as the ITR pair were identified. For example, in an 22 orthologous ITR pair, if one sequence was expressed in one tissue and the second sequence in 23 was expressed in two tissues, all orthologous transcribed exon pairs where one sequence was 24 expressed in one tissue and the second was expressed in two tissues were identified. From this 25 expression breadth-matched set of orthologous transcribed exon pairs, two pairs were randomly 26 selected and used for % commonality calculations (Fig. 3C). The same framework was used to 27 generate direct transcribed exon comparisons for orthologous transcribed intron pairs (Fig. 3D). 28 Note that because the exon comparison was tailored specifically to the expression breadth 29 distributions observed in pairs of orthologous ITRs or transcribed introns, the distribution of % 30 commonality among transcribed exons is distinct in Fig. 3C and Fig. 3D. Random expectations 31 of percent commonality (Fig. 3C,D, Rn) were calculated by randomly selecting tissues to match 32 the expression breadth of intergenic and intron transcribed region pairs. For each orthologous 33 ITR and transcribed intron pairs, 25 random pairs of tissues were selected that matched the 34 observed expression breadth. Similar to the comparisons generated among orthologous 35 transcribed exon pairs, the random pairs were generated specific for the expression observed in 36 ITR or intron pairs, and so the distribution of random % commonality is distinct for the two 37 sequence types. The probability that a tissue would be selected for a random set was 38 proportional to how often it appeared among transcribed regions with expression conservation. 39 40 We identified cross-species and within-species syntenic blocks by identifying of sets of 41 collinear genes using MCScanX v.2 (Wang et al. 2012). Multiple minimum gene pair and 42 maximum gap parameters were tested when generating collinear blocks (Table S7) and we  1 utilized values of 10 and 10, respectively, for within-species collinear blocks and 10 and 2, 2 respectively, for cross-species blocks. Cross-species blocks were identified between the more-3

Identification of syntenic gene blocks
closely-related species pairs: rice/B. distachyon and maize/sorghum. The rates of synonymous 4 substitutions (Ks) between homologous protein-coding genes anchoring within-species syntenic 5 blocks were calculated using the yn00 package in the Phylogenetic Analysis by Maximum 6 Likelihood software (Yang 2007). Syntenic blocks with a median Ks ≥ 0.7 across all anchor 7 genes in the syntenic block in all four species were considered associated with the ρ / σ whole blocks in maize with a median Ks < 0.7 were associated with the more recent 12 MYO maize 10 WGD (Fig. S4) (Swigoňová et al. 2004). An additional 254 gene-pair syntenic block in rice was 11 identified with a median Ks of 0.13, suggesting recent duplication. However, due to the 12 uncertain nature of the origin and timing of the duplication event (Wang et al. 2011), this block 13 was excluded from further analysis. For exon and intron transcribed regions, syntenic duplicates 14 and orthologs were identified as BLASTN matches (E-value<1x10 -5 ) that overlapped a 15 homologous anchor gene. For ITRs, and random intergenic sequences, syntenic duplicates and 16 orthologs were BLASTN matches that were present within corresponding block regions 17 circumscribed by homologous anchor genes (see Fig. 4B). 18

Rice functional prediction features 19
Transcription activity features 20 Twelve transcription-related features were generated for use in rice function prediction 21 models (Table S4). The first 9 features were FPKM values (referred to as expression level; 22 Level in Table S4) from each of the 9 tissues represented in the RNA-seq datasets described 23 above. For replicate datasets (leaf and endosperm), expression level was taken as the average 24 FPKM value if a region was expressed in both replicate datasets, and the single FPKM value 25 otherwise. If a transcribed region was not expressed in an RNA-seq dataset, expression level 26 was set to 0. Two additional features were represented by the maximum expression level 27 among all RNA-seq datasets and the median FPKM among datasets where a transcribed region 28 was expressed. The final feature was the expression breadth of a sequence, represented by the 29 total number of tissues in which a sequence was expressed. For the breadth calculations, both 30 seed datasets (5 and 10 DAP) and inflorescence datasets (early and emerging) were 31 considered as a single tissue. 32 Sequence conservation features 33 Three sequence conservation-related features were generated. The first feature was the 34 minimum BLASTN E-value to a sequence in B. distachyon, sorghum, or maize. A threshold of 35 1x10 -5 was required to consider a match significant. Sequences without a significant cross-36 species match where given a score of 0. The second sequence-conservation feature was based 37 on blocks of conserved nucleotides present across all four species were identified using the conserved nucleotide blocks (CNBs; n=60,801), we identified those that were exonic 42 (n=13,616), intronic (n=125), or intergenic (n=525) in all four species and were overlapped only 1 by transcribed regions of the associated type (e.g. a conserved nucleotide block that was 2 intergenic in all four species and was only overlapped by ITRs). The second conservation 3 feature was the proportion of a sequence that overlapped a CNB. phastCons scores were 4 calculated for each rice nucleotide within a conserved nucleotide block (Siepel et al. 2005), with 5 conserved and non-conserved states estimated using the --estimate-trees option. The third 6 sequence-conservation feature was the median per-base phastCons score across the 7 proportion of a sequence that overlapped a conserved nucleotide block. If a sequence did not 8 overlap a conserved block the phastCons score was set to 0. 9 Histone mark and nucleosome occupancy features 10 Twelve histone mark-related features among 10 histone marks were calculated based 11 on 18 chromatin immunoprecipitation sequencing (ChIP-seq) datasets from NCBI-SRA (Table  12 S8). Reads were trimmed as described above and mapped to the rice Michigan State University false discovery rate (FDR) ≤0.05 was utilized (see Table S8). Peaks for histone marks with 18 multiple datasets were merged. The first 10 histone mark-related features were represented by 19 the percent overlap of a sequence with histone mark peaks for each histone marks ('coverage ' 20 in Table S4). The other two features were the number of activation-associated histone marks 21 and repression-associated histone marks with a peak that overlapped a sequence. Eight marks 22 were considered activation-associated and two were considered repression-associated ( Table  23 S8). A single nucleosome occupancy feature was also generated from micrococcal nuclease 24 sequencing (MNase-seq) data. MNase-seq data was generated by Wu et al. (2014) and 25 processed by Liu et al. (2015). The nucleosome occupancy feature was calculated as MNase-26 seq average read depth across the length of a sequence.

27
DNA methylation features 28 Sixteen DNA methylation features were calculated from bisulfite-sequencing (BS-seq) 29 datasets from four tissues: embryo (SRR059000), endosperm (SRR059005), leaf (SRR618545), 30 and panicle (SRR1520042). BS-seq reads were trimmed as described above and processed 31 with Bismark v.3 (default parameters) to identify the number of reads that call cytosines as 32 methylated and unmethylated in CG, CHG, and CHH (H = A, C, or T) contexts. The first 12 33 features were represented by the methylation level of a sequence in CG, CHG, and CHH 34 contexts for each of the four tissues. Methylation level of a sequence was calculated as the 35 number of reads that mapped to CG, CHG, or CHH sites that called a cytosine site as 36 methylated divided by the total number of reads mapping to CG, CHG, or CHH sites within the 37 sequence. A minimum of 5 reads across 5 cytosine sites were required to calculate methylation 38 level. Multiple minimum read (range: 1-20) and site (range: 1-10) requirements were tested and 39 found to have little effect on the ability of resulting methylation level features to distinguish 40 between phenotype exons and transcribed pseudogene sequences ( Table S9). The final four 41 DNA methylation features were represented by whether a sequence exhibited a methylation 42 pattern consistent with gene body methylation (GBM) in each of the four tissues. The GBM 1 pattern is presence of CG methylation and absence of CHG and CHH methylation. Presence or 2 absence of CG, CHG, and CHH states within a sequence were determined by binomial tests of 3 the methylation level of a sequence (as described above) compared to the background 4 methylation level across the whole genome for a given cytosine context. Binomial test P-values 5 were corrected for multiple testing using the Benjamini and Hochberg method (FDR ≤ 0.05) 6 (Benjamini and Hochberg 1995). Sequences that were significantly enriched in CG methylation 7 relative to the genome background and not significantly enriched in CHG and CHH methylation 8 were considered to be gene body methylated. 9 Machine learning approach 10 The machine learning approach consisted of four steps: establishing a set of benchmark 11 functional and nonfunctional sequences, identifying characteristics associated with benchmark 12 sequences, integrating these characteristics via statistical learning methods to generate a 13 functional prediction model, and applying function prediction models to all transcribed regions, 14 including ITRs. We utilized the random forest algorithm implemented in the Scikit-learn software Benchmark nonfunctional sequences were represented by transcribed regions that overlapped 20 pseudogene annotation (referred to as transcribed pseudogenes). We further filtered out 21 pseudogenes that overlapped an exon from the MSU v.7 gene annotation. Benchmark non-22 coding sequences were transcribed regions that overlapped a set of high-confidence miRNA 23 gene annotations from miRBase (referred to as transcribed miRNAs) (Kozomara and Griffiths-24 Jones 2014). Prior to model building, missing data points among features were imputed by 25 sequential regression imputation implemented in the mice package in R (m = 1, seed = 500) 26 (van Buuren and Groothuis-Oudshoorn 2011).

27
A machine learning-based prediction to distinguish between phenotype exons and 28 transcribed pseudogenes was generated. We generated 100 balanced datasets that included 29 equal proportions of phenotype exon and transcribed pseudogene sequences that were used 30 for training and 10-fold cross-validation was implemented (i.e. 90% of a dataset was used for 31 training and the held-out 10% used for testing). Parameter sweeps of maximum tree depth (3,5,32 10, and 50) and proportion of random features (10%, 25%, 50%, 75%, square root, and log2) 33 values were performed, with 10 and 25% providing the highest performance, respectively. For 34 each of the 100 balanced datasets, a prediction score for each transcribed region was 35 generated that was equal to the proportion of 500 random forest trees that predicted the region 36 as a phenotype exon; a final prediction score was calculated as the median of the 100 scores 37 from the balanced datasets. Two thresholds to predict transcribed regions as likely-functional 38 were generated based on a false positive rate of 5% (T1; score threshold = 0.60) and a false 39 negative rate of 5% (T2; score threshold = 0.29). The Python script utilized to generate these 40 predictions (ML_classification.py) is available on GitHub: https://github.com/ShiuLab/ML-41 Pipeline. Prediction models were also generated using the support vector machine and logistic 42 regression algorithms implemented in Scikit-learn, but random forest provided the highest 1 performance by AUC-ROC. on a 5% false negative rate. Blue, purple, and red: phenotype exon-like, ambiguous, and 3 transcribed pseudogene-like, respectively. Intergenic sequences were randomly-sampled to determine the background expected 25 reproducibility (Rn). random intergenic (Rn) sequences. Fisher's exact tests were used to test significance between 33 the proportion of sequences that were duplicated (sum of Ex, In, Ps, and ITR duplicate 34 proportions) versus proportion of singletons. Although duplication rates between transcribed 35 exons and ITRs could be significantly different, almost all differences were less than 5%, 36 indicating that the presence of a duplicate is not informative to whether a sequence is likely part 37 of an annotated gene. *: p<0.05, ***: p<0.001, n.s.: not significant, p≥0.05, In: intron, Ps: 38 pseudogene. 39 40 not included in further analysis. Os: rice, Bd: B. distachyon, Sb: sorghum; Zm: maize. 8 9 Fig. S5. Boxplots of the relationship between distance to nearest gene and predicted functional 10 likelihood for ITRs 11 12

SUPPLEMENTAL TABLES
13 Table S1. NCBI-SRA datasets used in transcribed region identification. 14 Table S2. Coordinates of intergenic transcribed regions in four Poaceae species. 15 Table S3. Machine learning instance coordinates and feature values. 16 Table S4. Machine learning feature list and single-feature AUC-ROC performance. 17 Table S5. Functional likelihood scores from machine learning predictions. 18 Table S6. NCBI-SRA datasets used in gene annotation. 19 Table S7. Parameters tested for syntenic block identification. 20 Table S8. NCBI-SRA datasets used in histone mark peak identification. 21 Table S9. Parameters tested for DNA methylation level feature calculation. 22 23