EORNA, a barley gene and transcript abundance database

A high-quality, barley gene reference transcript dataset (BaRTv1.0), was used to quantify gene and transcript abundances from 22 RNA-seq experiments, covering 843 separate samples. Using the abundance data we developed a Barley Expression Database (EORNA*) to underpin a visualisation tool that displays comparative gene and transcript abundance data on demand as transcripts per million (TPM) across all samples and all the genes. EORNA provides gene and transcript models for all of the transcripts contained in BaRTV1.0, and these can be conveniently identified through either BaRT or HORVU gene names, or by direct BLAST of query sequences. Browsing the quantification data reveals cultivar, tissue and condition specific gene expression and shows changes in the proportions of individual transcripts that have arisen via alternative splicing. TPM values can be easily extracted to allow users to determine the statistical significance of observed transcript abundance variation among samples or perform meta analyses on multiple RNA-seq experiments. * Eòrna is the Scottish Gaelic word for Barley.

and Salmon from short-read RNA-seq data 20,21 . Levels of expression from these tools are measured in Transcripts per million (TPM) for a given BaRTv1.0 transcript 22 . Quantification at the transcript level further allows robust and routine analysis of alternative splicing [23][24][25] . Here we used the barley reference transcript dataset, BaRTv1.0, to demonstrate the value and utility of a barley RTD for gene expression studies and AS analysis. We used BaRTv1.0 to quantify transcripts in 22 RNA-seq datasets covering 843 samples from a broad range of genotypes, tissues and different abiotic and biotic stress conditions. BaRTv1.0 was assembled against the cv. Morex genome, but in this analysis we used RNA-seq data from a wide-range of cultivars and lines and found that mapping rates in all cultivars remained high. We found expression and alternative splicing abundances varied between cultivars, tissues/organs and between environmental changes and stresses. The data is presented in a freely available single accessible database that gives visual and numerical access to expression data for barley genes across all the tested barley samples (https://ics.hutton.ac.uk/eorna/index.html. The importance of comparing between sample sets allows researchers to answer how their gene of interest is expressed in other tissues or under what condition. Commercially available Genevestigator ® 26,27 and the freely available Bio-Analytic resource (BAR) 28,29 visualise barley gene transcriptional expression and regulation RNA-seq and microarray data across multiple experimental conditions. Here, individual transcript RNA-seq expression results are displayed in graphical form, simply as TPM values directly from the outputs of Salmon, without considering batch differences that may occur between samples, differences among experimental studies and without statistical significances. To include statistical analysis and thereby define significant differential gene expression (DE) or differential alternative splicing (DAS) would require complete control over experimental design, sample preparation and sequencing analysis. These interactive plots, therefore, simply permit rapid visual assessment of expression levels of selected genes of interest. TPM values are accessible and allow users to perform their own DE and DAS analysis, such as found in the 3D RNA-seq interactive graphical user interface 30 or by comparing multiple RNA-seq datasets by meta-analysis methods [31][32][33][34] . Output expression values such as TPM from RNA-seq experiments are under continuous discussion and development and may be affected by sequencing protocols and experimental conditions 35 . TPM values were calculated using Salmon to allow transcript abundances to be compared between samples. To check that the TPM values were representative as expression values, we determined variability across all the samples using linear regression analyses and found that the output from Salmon showed the lowest variability and therefore provided the best normalisation across all the samples.
We show examples of genes that clearly illustrate the wide utility offered by access to datasets from multiple RNA-seq experiments. The plots identified genes that were uniquely expressed in a cultivar, tissue or condition specific manner. Considering the range of samples displayed, the unique abundances in tissue-or condition-specific samples support the potential value of these genes as expression 'biomarkers' for that tissue or condition. The plots identified cis-and trans-acting induced (or loss of) expression of genes that segregate among near isogenic lines or mutant populations, identified cultivar specific polymorphisms or insertion/deletions and alternatively spliced transcripts including significant switching in splice site selection as a response to a condition were found. Alteration of transcript isoform abundance can alter translational reading frames or transcript stability. Ultimately, BaRT RTD is part of a unique pipeline that facilitates fast robust routine quantification of barley gene transcripts, visualised in EORNA through interactive intuitive transcript abundance plots linked to gene models and metadata, finally leading to robust and consistent estimation of barley gene expression and alternative splicing across multiple samples.

Methods
Selected RNa-seq datasets and data processing. A total of 22 publicly available RNA-seq datasets consisting of 843 samples including replicates were downloaded from NCBI -Sequence Read Archive database (https://www.ncbi.nlm.nih.gov/sra/) to quantify against the barley RTD (BaRTv1.0) (Supplementary Table S1). All datasets were produced using Illumina platforms and were selected with mostly > 90 bp and paired-end reads with a quality of q > = 20. All raw data were processed using Trimmomatic-0.30 36 using default settings to preserve a minimum Phred score of Q20 over 60 bp. One of the samples (NOD1) was over-represented with respect to read numbers due to a repeat run being necessary and was therefore subsampled to 60 million reads. Read quality checks before and after trimming were performed using FastQC (fastqc_v0.11.5) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

Generation of the EORNa database.
A database and website front-end were constructed to allow easy access to BaRTv1.0 transcripts and expression analyses using the LAMP configuration (Linux, Apache, mySQL, and Perl). Additional annotation was added to the transcripts by homology searching against the predicted peptides from rice (rice pseudo-peptides v 6.0 37 ) and from Arabidopsis thaliana (TAIR pseudo-peptides v 10, The Arabidopsis Information Resource) 38 using BLASTX at an e-value cutoff of less than 1e-50 39 . The website https:// ics.hutton.ac.uk/eorna/index.html allows users to interrogate data through an entry point via three methods: (i) a BLAST search of the reference barley assembly or the predicted transcripts; (ii) a keyword search of the derived rice and Arabidopsis thaliana BLAST annotation, and; (iii) a direct string search using the transcript, gene, or contig identifiers. To distinguish this set of predicted genes and transcripts from previously published 'MLOC_' and HORVU identifiers, genes were prefixed as 'BART1_0-u00000' for the unpadded or 'BART1_0-p00000' for the padded QUASI version, with BART1_0-p00000.000 representing the individual transcript number. The RNA-seq TPM values are shown in interactive stacked bar plots produced with plotly R libraries (https://plotly.com/r/) and the TPM values are also available as a text file for each gene. The exon structures of the transcripts for each gene are shown in graphical form, and links to the transcripts themselves provides access to the transcript sequences in FASTA format. Each transcript has also been compared to the published set of predicted genes (HORVUs) to provide backwards compatibility.
www.nature.com/scientificdata www.nature.com/scientificdata/ GO annotation. Transcript sequences were translated to protein sequences using TransDecoder (https:// github.com/TransDecoder/TransDecoder/wiki). Gene Ontology (GO) annotation was then determined by running all 60,444 genes in BaRTv1.0 through Protein ANNotation with Z-score (PANNZER) 40 . GO annotations were based on predicted proteins with ORF >100 amino acids and orthologues found in the Uniprot database. Output annotations were placed in a lookup table with text descriptions about protein functionality.
The results matrix containing all the TPM values across all 843 samples for all 177,240 BaRTv1.0 transcripts can be downloaded directly along with the metadata file from https://ics.hutton.ac.uk/eorna/download.html. An additional version of the results matrix and metadata file is available in the Zenodo repository (https://doi. org/10.5281/zenodo.4286079) 42   genes and 177,240 transcripts mapped to the cv. Morex pseudomolecules. To access the barley reference transcript dataset a public database and website front-end were constructed to allow researchers to download the reference transcript dataset and interrogate the data via a BLAST search, keyword search or string search using the BaRT or HORVU gene/transcript identifiers (https://ics.hutton.ac.uk/barleyrtd/index.html) 19 . The transcripts are arranged as gene models and viewed through GBrowse 43 . Transcript sequences are given in FASTA format and homologies of the longest transcripts are compared to Arabidopsis, Rice and Brachypodium. Until now, Salmon calculated TPM values for each gene across 16 different tissues/developmental stages in both graphic and tabular formats is presented. Since the initial publication, the BaRTV1.0 database has continued to evolve and we have established Gene Ontology (GO) annotation for 26,794 genes using Protein ANNotation with Z-score (PANNZER) 40 with text descriptions about protein functionality and provided a lookup table for download.

EORNA database -Quantification of multiple RNA-seq samples and expression plots. Establishing
BaRTv1.0 has facilitated fast, precise quantification of RNA transcript abundance from any barley short-read RNA-seq dataset. We used BaRTv1.0 to quantify transcript abundance and diversity observed in a collection of 22 Illumina short-read RNA-seq experiments, 18 of which were obtained from the short-read archive (SRA) and the remainder produced in-house. Each RNA-seq experiment was given a label that contained the letter E (referring to external datasets) followed by a number or the letter I (internal datasets) followed by a number. The datasets contained a total of 843 samples and 3,762 Gbp of expressed sequences. The samples consist of both barley landraces and cultivars, an array of organs and tissues at different developmental stages, and plants/seedlings grown under a range of biotic and abiotic stresses (Supplementary Tables S1 and S2). Most RNA-seq datasets consisted of paired-end reads (90-150 bp in length) and were produced using Illumina HiSeq 2000, 2500, 4000 or HiSeq X instruments. Exceptions were the dataset from Golden Promise anthers and meiocytes, which contained over 2 billion paired end 35-76 bp reads. The raw RNA-seq data from all samples was trimmed and adapters removed using Trimmomatic and quality controlled using FastQC. TPM values were calculated individually for all 843 RNA-seq samples using Salmon (version Salmon-0.8.2) using BaRTv1.0-QUASI, a 'padded' version of BaRTv1.0 which has been shown to improve transcript quantification, as the reference transcript dataset 19 . As BaRTv1.0 was assembled using the cv. Morex reference genome, we first assessed the mapping rates from all samples, including those from other genotypes. The Morex samples showed an average mapping rate of 94.39% (SD 8.18%) while the remaining samples, which consisted of 60 different barley genotypes showed a slightly reduced mapping rate of 92.32% (SD 4.93%) (Supplementary Table 3).
Salmon estimates the relative abundance of different transcript isoforms in the form of transcripts per million (TPM), a commonly used normalisation method computed considering the library size, number of reads and the effective length of the transcript 20,21 . The EORNA data provides an opportunity to examine the effect of the normalisation procedure across many diverse samples. Regression analyses was used to explore the raw read counts and different versions of normalised counts by library size and effective length of the transcript. Good normalisation procedures will remove most of the dependency on these variables such that the output of regression analysis represented by the R-square value (which measures the percentage of variation accounted for) can be used to compare different normalisations. Here, an R-square value closer to zero indicates effective normalisation. For efficient calculation, we first reduced the number of transcripts by selecting those which had non-zero values in at least 80% of the samples. This left 32739 transcripts over the 843 samples and gave 27,598,977 values to study how different normalisation approaches accounted for variation between experiments. Regression analysis was used first to explore the relationship between raw read counts by library size and length of the transcript, which gave an adjusted R-squared value of 1.28% indicating low predictive value within the dataset. Transposing variables to a log-scale increased the R-square to 10.68%, which suggested a far stronger predictive value on this scale and shows that a large amount of variation in the raw counts can be removed by log-transforming. Replacing the log counts with normalised data using Salmon's effective transcript length, which corrects for transcript length bias 20 , reduced the adjusted R-square value to 0.09%. This compared to normalisation by RPKM (Reads Per Kilobase www.nature.com/scientificdata www.nature.com/scientificdata/ per Million which normalises the raw read count by transcript length and sequencing depth) (adjusted R-square of 0.57%) or TPMs calculated by transcript length alone (adjusted R-square of 0.62%). (Online-only Table 1). In summary, the normalised TPM outputs from Salmon using an effective transcript length reduced variability such that most of the dependency on library size and transcript length was removed.
The normalised output TPM values from Salmon were collated and plotted using plotly R libraries (https:// plotly.com/r/) to allow quick subjective and interactive comparisons in transcript abundance levels between the samples. The TPM values for each gene/plot are also given as a text file for download. We chose to plot the graphs as the TPM values without log scaling, to show the additive changes between the samples and replicates.
Expression plot utility. Stacked bar graph plots display the TPM values calculated by Salmon for all 60,444 genes in the database for all 843 samples, representing over 50 million plot points. The x-axis displays the 843 samples versus the y-axis which displays transcript abundance in each sample as TPM values (Fig. 1). Each individual sample bar graph stacks the TPM values contributed by each gene transcript to permit visualisation of the differences in transcript abundances between different samples and helps identify the predominant transcript(s) for that gene. Each plot may be scanned interactively to activate a label that gives information on the RNA-seq experiment, sample run number, tissue and treatment for that sample (from the metadata table, Supplementary  Table 2). Users can zoom in to focus on individual experiment and sample plots. Without processing the data or assigning any statistical significance to the graphs, the results presented allow the researcher to determine whether their gene(s) of interest are expressed in the different experiments and among samples within an experiment. Large changes in TPM abundances were observed between the samples for many genes. For example, BaRT1_u-31819 showed altered gene expression in the root meristematic zone compared to the root elongation and maturation zones in the E1 dataset, which is further supported by expression in the root tissue in the I1 dataset (Fig. 1).
Tissue specific expression. The experimental panel of 22 RNA-seq datasets were from a broad range of cultivars, tissues, organs and biotic and abiotic conditions. The interactive plots enable the user to quickly identify potential candidate genes that show a high degree of tissue specificity. For example, BART1_0-u49225 (with similarity to a UDP-Glycosyltransferase superfamily protein) was specifically and highly expressed to over 1,000 TPM in developing grain 15 days post anthesis (I1) and in developing barley spikes that contain developing grain (E20). Expression was segregating in hulless barley grain in recombinant inbred lines that were used to assess glucan content (E10). (Fig. 2a). BART1_0-u14427 was highly abundant only in tissues subjected to low temperature stress (E2 and I2) (Fig. 2b) and BART1_0-u50915 is one of a number of barley Pathogenesis-related 1 protein genes that was induced to over 10,000 TPM in response to Cochliobolus sativus (E19) and Fusarium graminearum (E20) (Fig. 2c).

Confirmatory expression.
Interactive plots may be used to investigate the expression of genes that have been previously studied in a limited number of tissues/cultivars or using a different expression platform and consequently expands expression analysis across the range of tissues that are currently in EORNA. For example, we previously described the expression of INTERMEDIUM-C (BART1_0-u26546; HORVU4Hr1G007040), a modifier of lateral spikelet fertility in barley and an ortholog of the maize domestication gene TEOSINTE BRANCHED 1. Microarray analysis of 15 tissues showed that transcript abundance was low with greatest expression in the developing inflorescence 44 . The RNA-seq panel here confirmed low abundances for this gene across all the samples (<7.5 TPM), with greatest expression in shoot apices (E7); apical meristems (E13) and developing spikes at the awn primordium stage (E14) (Fig. 3).

Segregation expression.
The RNA-seq datasets consist of several experiments that contain mutant lines targeted to specific genes, recombinant inbred lines (RILs) and near isogenic lines (NILs). The expression of genes found at quantitative trait loci, or through genome-wide association studies show changes in gene expression at these loci between the parents and in the population. The seed longevity experiment (E17) illustrated gene  www.nature.com/scientificdata www.nature.com/scientificdata/ expression changes in RILs and NILs from the landraces L94 (short-lived seeds) and Cebada capa (long-lived seeds). QTL analysis identified three QTLs on 1H (SLQ1.1 to 1.3) and a single QTL on 2H (SLQ2). Gene expression analysis identified differentially expressed genes positioned within the SLQ1 and 2 regions 45 . Using the interactive plots confirmed the barley population expression pattern of these differentially expressed genes. The plots show changes among the parental types retained in the recombinant inbred and near isogenic lines (Fig. 4). For example, BART1_0-u01011(MLOC_61374) is positioned within SLQ1.1 and showed low expression in Cebada capa and the NILs at SLQ1.1 (Fig. 4a) and BART1_0-u15865 (MLOC_73587) showed expression in Cebada capa that was absent in L94 and found expressed in SLQ2 NILs Fig. 4b). The transcript abundances of these genes were shown in the context of the remaining 21 RNA-seq experiments tested.
Gene targeted mutations. Deletion or substitution mutations may impact regulatory gene sequences governing the expression of a target gene or alter the protein coding region of a gene. The outcome of a mutation on observed transcript abundance may vary substantially, resulting in loss, reduced, maintained or increased transcript levels. The interactive plots allow researchers to observe rapidly and intuitively the effect of a mutation on the expression of a target gene and possible trans-acting effects on the expression of other genes. For example, experiment E19 consists of a series of disease resistance tests on cv. Morex and a gamma irradiation induced Morex mutant (14-40) selected for its susceptibility to spot blotch (Bipolaris sorokiniana) 46 . The expression of BART1_0-u18601; HORVU3Hr1G019920 (glycine-rich protein) and BART1_0-u41161; HORVU5Hr1G120850 (similarity to a long-chain-fatty-acid-CoA ligase 1) were knocked out in the mutant, which is clearly observed in the interactive plots (Fig. 5).

Transcript variation between cultivars.
To create the BaRTv1.0 RTD, transcripts from multiple datasets from a range of tissues, treatments and cultivars were mapped to cv. Morex pseudomolecules to maximise read coverage support for genes and splice junctions 19 . BaRTv1.0 is, therefore, a predominantly cv. Morex RTD. Nevertheless, transcripts that contain indels in other cultivars will be found in BaRTv1.0. Salmon quantifications of the 843 individual samples was able to identify and quantify cultivar specific transcripts. BaRT1_u-06868 showed a selection of different transcripts due to genotype differences. Alignment with genomic sequence and the most highly abundant transcripts shows a small run of 4 GCAG repeats in one genotype compared to a run of 3 GCAG repeats in a different genotype. These genotype specific variant transcripts were observed across the range of cultivars used in the RNA-seq experiments. For example, the experimental dataset E1 shows two different cultivars cvs. Clipper and Sahara with two different main transcript variants, which is the result of the 4 bp indel. Clipper shows use of the transcripts .001 and .002 while Sahara uses transcripts .005 and .006 (Fig. 6). The transcriptome assemblies and quantifications using BaRTv1.0 shows that cultivar specific transcripts can be easily distinguished.
Alternative splice site switching. Selection of alternative splice sites results in the formation of multiple alternative transcripts. The proportions of alternative transcripts may change in different tissues or as the result of a changing environment. Many of these changes require detailed analysis to determine significant changes in the amounts and proportions of the alternative transcripts. Nevertheless, the stacked bar graphs allow large changes in the abundance of alternative transcripts to be detected between samples. For example, BaRT1_u-00022 was expressed across all tissues but in some samples an alternative transcript, BaRT1_u-00022.001, shown in blue, predominated over BaRT1_u-00022.003 shown in green (Fig. 7a). The difference between the two transcripts  www.nature.com/scientificdata www.nature.com/scientificdata/ was an alternative intron in the 3′UTR, which was retained in transcript .001 and spliced out in transcript .003. Comparison with the meta-data (Supplementary Table 2) showed tissue specific abundance of transcript .001 in grain/caryopsis and germinating grain (coleoptiles) in the experimental datasets E8, E10, E17, I1 and I2. Comparison across the different experiments and replicates supports both the tissue and cultivar specific variation. For example, the alternative 0.001 transcript was also observed in Golden Promise in datasets E11 and I6. The plots also illustrate dynamic changes in alternative splicing in different tissues or because of different stresses. For example, BaRT1_u-40919, which has similarity to a cold inducible Zinc finger-containing glycine-rich  www.nature.com/scientificdata www.nature.com/scientificdata/ RNA-binding protein, shows switching of transcript .001 to .005 during cold stress, which is the result of the selection of an alternative intron (I2) (Fig. 7b). In both these cases, the reading frame of the protein is unaffected but extends the length of the 3′UTR in the transcripts where the intron is retained. These examples highlight transcript variation because of dynamic alternative splicing as a result of tissue/organ specific splicing or changing environmental conditions. Data validation. We did not carry out validation experiments using alternative methods, such as RT-PCR, as we do not have access to all the RNA samples used to produce the RNA-seq data. However, multiple RNA-seq samples consisted of similar tissues or conditions that showed similar gene expression responses. This was particularly noticeable in the genes that showed tissue or condition specific expression, such as those from developing grain tissue, low temperature stress and in response to pathogens (Fig. 2). In addition, we have previously performed RT-PCR alternative splicing validation experiments on 5 of the tissues in the I1 RNA-seq experiment and found a strong correlation (R-square = 0.83) with the alternatively spliced transcript proportions of RNA-seq, supporting the ability of the RNA-seq data to accurately detect changes in AS 19 . technical development. BaRT is under constant incremental improvement. The next release of BaRT is being developed by incorporating new short and, importantly, long-read RNA-seq datasets. The need to capture the diversity of different transcripts from a wider range of genotypes will further lead to the development of a pan-transcriptome barley RTD to match a barley pan-genome sequence 5,47,48 . This will ultimately result in recalculation of the EORNA TPM values. In addition, new RNA-seq experiments are constantly submitted to the sequence archives. We are currently developing a pipeline that allows automated addition of newly deposited RNA-seq datasets associated with subsequent quantification using the latest RTD and updated releases of EORNA. This will continually expand the utility of the interactive plots and provide straightforward and open access of RNA-seq data to researchers, adding considerable value to the stand-alone RNA-seq datasets. Access to TPM values will enable the construction of transcript/co-expression/regulatory networks and support the development of proteomic resources for barley.

Usage Notes
The expression data is easily accessible through an intuitive and easy to use Web interface: https://ics.hutton. ac.uk/eorna/index.html.
Gene and transcript sequence information and expression data can be accessed through Homology Searches, Annotation Searches or thorough BLAST nucleotide or protein sequences. Barley Pseudomolecule gene names (HORVU numbers) can be easily translated to BART identifiers.
The plots showing individual gene expression across all the samples has a link under the plot to a text delimited file with all the expression (TPMs), tissue, condition, cultivar and replicate. The whole dataset describing expression of all the BaRT genes can downloaded as a single txt delimited file. This is further stored at https://doi. org/10.5281/zenodo.4286079 42 .